c.hoover(a)its.utexas.edu wrote:
> Full_Name: Clyde Hoover
> Version: 2.4.8
> OS: Solaris 10 SPARC
> URL:
> Submission from: (NULL) (128.83.93.7)
>
>
> The bug is around line 2015 of servers/slapd/overlays/syncprov.c
>
> } else if ( rs->sr_type == REP_RESULT&& rs->sr_err == LDAP_SUCCESS ) {
> struct berval cookie;
>
> // The cookie is never initialized so if SS_CHANGED is not set
> …
[View More] // this can cause an internal assertion failure downstream from
> // syncprov_done_ctrl() (depending upon what is on the stack)
> // Nulling out the cookie fixes this
> memset((void *)&cookie, 0, sizeof(cookie)); // Add this
> if ( ss->ss_flags& SS_CHANGED ) {
> slap_compose_sync_cookie( op,&cookie, ss->ss_ctxcsn,
> srs->sr_state.rid,
> srs->sr_state.srs->sr_state.sid );
Please provide a stack trace from such a failure. The syncprov_done_ctrl()
function should never be invoked at that point without SS_CHANGED being set,
since in a normal Refresh the search will hit the shortcut and never reach
that code.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
Full_Name: Clyde Hoover
Version: 2.4.8
OS: Solaris 10 SPARC
URL:
Submission from: (NULL) (128.83.93.7)
The bug is around line 2015 of servers/slapd/overlays/syncprov.c
} else if ( rs->sr_type == REP_RESULT && rs->sr_err == LDAP_SUCCESS ) {
struct berval cookie;
// The cookie is never initialized so if SS_CHANGED is not set
// this can cause an internal assertion failure downstream from
// …
[View More]syncprov_done_ctrl() (depending upon what is on the stack)
// Nulling out the cookie fixes this
memset((void *)&cookie, 0, sizeof(cookie)); // Add this
if ( ss->ss_flags & SS_CHANGED ) {
slap_compose_sync_cookie( op, &cookie, ss->ss_ctxcsn,
srs->sr_state.rid,
srs->sr_state.srs->sr_state.sid );
[View Less]
Rein Tollevik wrote:
> On Wed, 19 Mar 2008, Howard Chu wrote:
>
>>> syncprov_playlog() in syncprov.c generates such invalid cookies when
>>> replaying a
>>> sessionlog where no entries has been deleted. The patch also fixes this
>>> bug.
>> This should never happen, since the cookie is not generated if there were no
>> deletes recorded.
>
> Well, it do happen anyhow. Using the 2.4.8 release this can easily be
> reproduced with a …
[View More]configuration where the syncrepl provider (with
> sessionlogs enabled) is on a glue backend, the consumer replicates the
> entire glue subtree. Assuming both servers are in sync and stopped,
> start the producer, modify an entry in the subordinate db and then start
> the consumer. The producer will send a cookie with an empty csn= value
> as in this sync debug output:
> slapd starting
> slap_queue_csn: queing 0x418001a0 20080320125116.487494Z#000000#000#000000
> slap_graduate_commit_csn: removing 0x8d6ae0 20080320125116.487494Z#000000#000#000000
> srs csn 20080320124421.642686Z#000000#000#000000
> log csn 20080320125116.487494Z#000000#000#000000
> syncprov_playlog: cookie=rid=007,csn=
> syncprov_search_response: cookie=rid=007,csn=20080320125116.487494Z#000000#000#000000
>
> I have also tested it using the cvs head source, and the good news is
> that this was accidentally fixed (at least partly) by the ITS#5434
> change. I.e it was caused by the be_search() in syncprov_playlog() not
> being propagated to the subordinate database and thereby not finding the
> entry that was modified.
>
> The bad news is that there is still a race condition that would have the
> same effect. If someone deletes the entry that was modified just before
> be_search() is done it will return the same empty cookie. I was able to
> reproduce this in a debugger by setting a break point before be_search(),
> deleting the entry with ldapdelete and in the debugger run next over the
> be_search() call before continuing.
>
> So, I still think it is best to keep delcsn[0].bv_val NULL until an
> actual cookie has been written to it, as slap_compose_sync_cookie() will
> create a cookie with an empty csn= if passed a csn array where bv_len is
> zero but bv_val is non-NULL. Although an alternative and maybe more
> correct fix could be to change slap_compose_sync_cookie() instead?
No. Thanks for your explanation of what you observed. The actual bug is in the
mods processing, when it decides that a modified entry should be treated as a
delete instead. It increments ndel then but doesn't set delcsn in that case.
That's what actually needs to be fixed.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
<quote who="hyc(a)symas.com">
> Duncan Gibb wrote:
>> Howard Chu wrote:
>> HYC> Most notably, in template 1, you explicitly configured
>> HYC> syncprov on top of glue instead of taking the default
>> HYC> behavior of syncprov under glue. As documented in
>> HYC> slapd.conf(5), you should only do that when you want a
>> HYC> single syncprov overlay to be master of the entire tree.
>> HYC> In this case, you clearly want …
[View More]syncprov to only master
>> HYC> one portion of the tree, with the subordinate being
>> HYC> taken care of by its own syncprov overlay.
>>
>> That's a legacy of several previous iterations of the real project.
>> Ideally each box would have at most two backends - one holding
>> everything for which it's a consumer, and on the relevant machines a
>> second backend holding the data for which it's master, then glue, then a
>> single syncprov making the whole tree available. My colleague who
>> implemented the first iteration did it that way and it should have
>> worked, but the syncprov-glue interactions in 2.3.x were such that that
>> resulted in sections of the tree not replicating and/or disappearing in
>> various ill-defined situations.
>
> The whole-tree approach may have some snags.
> 1) it's intended to work with a single syncprov over the entire glued
> tree
> 2) it requires a unique serverID for each master
> 3) it probably won't work to start from a completely empty DB - the
> root
> entry needs to exist on each server.
It's probably worth while for us and others to start a new thread on
-software or -technical?
Gavin.
[View Less]
Duncan Gibb wrote:
> Howard Chu wrote:
> HYC> Most notably, in template 1, you explicitly configured
> HYC> syncprov on top of glue instead of taking the default
> HYC> behavior of syncprov under glue. As documented in
> HYC> slapd.conf(5), you should only do that when you want a
> HYC> single syncprov overlay to be master of the entire tree.
> HYC> In this case, you clearly want syncprov to only master
> HYC> one portion of the tree, with …
[View More]the subordinate being
> HYC> taken care of by its own syncprov overlay.
>
> That's a legacy of several previous iterations of the real project.
> Ideally each box would have at most two backends - one holding
> everything for which it's a consumer, and on the relevant machines a
> second backend holding the data for which it's master, then glue, then a
> single syncprov making the whole tree available. My colleague who
> implemented the first iteration did it that way and it should have
> worked, but the syncprov-glue interactions in 2.3.x were such that that
> resulted in sections of the tree not replicating and/or disappearing in
> various ill-defined situations.
The whole-tree approach may have some snags.
1) it's intended to work with a single syncprov over the entire glued tree
2) it requires a unique serverID for each master
3) it probably won't work to start from a completely empty DB - the root
entry needs to exist on each server.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
Howard Chu wrote:
HYC> I modified your test case to make it easier to run.
HYC> The new test is in ftp.openldap.org:incoming/its5430.tgz.1
Looks good; thank you very much for this.
HYC> There were also some errors in your config file templates,
HYC> which are corrected in the above archive.
HYC> Most notably, in template 1, you explicitly configured
HYC> syncprov on top of glue instead of taking the default
HYC> behavior of syncprov under glue. As documented in
HYC…
[View More]> slapd.conf(5), you should only do that when you want a
HYC> single syncprov overlay to be master of the entire tree.
HYC> In this case, you clearly want syncprov to only master
HYC> one portion of the tree, with the subordinate being
HYC> taken care of by its own syncprov overlay.
That's a legacy of several previous iterations of the real project.
Ideally each box would have at most two backends - one holding
everything for which it's a consumer, and on the relevant machines a
second backend holding the data for which it's master, then glue, then a
single syncprov making the whole tree available. My colleague who
implemented the first iteration did it that way and it should have
worked, but the syncprov-glue interactions in 2.3.x were such that that
resulted in sections of the tree not replicating and/or disappearing in
various ill-defined situations.
Replicating each subordinate separately and gluing them together
per-server was a hack; the default overlay ordering gave us visible
entries of objectclass glue (and consequent missing subtrees) on some
consumers, so we reversed them. It's imperfect but less imperfect than
it was before ;-)
We will re-architect in the light of the recent changes to syncprov and
glue. It has already been pointed out that big-picture discussion is
noise on this list...
HYC> Not an error, but you used unique rids across all of the
HYC> config files. rids only need to be unique within a single server.
I know. That's expediency in the config generator - it just increments
a global counter each time it processes a consumer<-->provider relation.
I'll make it more elegant next time round ;-)
Cheers
Duncan
[View Less]
<quote who="hyc(a)symas.com">
> Duncan Gibb wrote:
>> hyc(a)symas.com wrote:
>> HYC> syncrepl.c is now updated in HEAD to allow this replication to
>> occur.
>>
>> HYC> Using your test, I don't see any crash.
>>
>> Thank you. We'll re-test with current HEAD and post the outcome.
>
> I modified your test case to make it easier to run. The new test is in
> ftp.openldap.org:incoming/its5430.tgz.1
Above tests pass here with …
[View More]latest HEAD (repository revision: 1.390
/repo/OpenLDAP/pkg/ldap/servers/slapd/syncrepl.c,v
>
> You simply extract it inside the tests/ subdirectory of the build tree.
> After
> you have built the OpenLDAP source and run "make test" you can cd into the
> its5430 directory and run the test.sh from there.
>
> There were also some errors in your config file templates, which are
> corrected
> in the above archive.
>
> Most notably, in template 1, you explicitly configured syncprov on top of
> glue
> instead of taking the default behavior of syncprov under glue. As
> documented
> in slapd.conf(5), you should only do that when you want a single syncprov
> overlay to be master of the entire tree. In this case, you clearly want
> syncprov to only master one portion of the tree, with the subordinate
> being
> taken care of by its own syncprov overlay. With your incorrect
> configuration,
> server3 would receive redundant updates because the subordinate branch
> would
> be propagated twice.
server3 also doesn't need syncprov module loaded, as it's a pure consumer.
Thanks,
Gavin.
[View Less]
On Wed, 19 Mar 2008, Howard Chu wrote:
>> syncprov_playlog() in syncprov.c generates such invalid cookies when
>> replaying a
>> sessionlog where no entries has been deleted. The patch also fixes this
>> bug.
>
> This should never happen, since the cookie is not generated if there were no
> deletes recorded.
Well, it do happen anyhow. Using the 2.4.8 release this can easily be
reproduced with a configuration where the syncrepl provider (with
sessionlogs …
[View More]enabled) is on a glue backend, the consumer replicates the
entire glue subtree. Assuming both servers are in sync and stopped,
start the producer, modify an entry in the subordinate db and then start
the consumer. The producer will send a cookie with an empty csn= value
as in this sync debug output:
slapd starting
slap_queue_csn: queing 0x418001a0 20080320125116.487494Z#000000#000#000000
slap_graduate_commit_csn: removing 0x8d6ae0 20080320125116.487494Z#000000#000#000000
srs csn 20080320124421.642686Z#000000#000#000000
log csn 20080320125116.487494Z#000000#000#000000
syncprov_playlog: cookie=rid=007,csn=
syncprov_search_response: cookie=rid=007,csn=20080320125116.487494Z#000000#000#000000
I have also tested it using the cvs head source, and the good news is
that this was accidentally fixed (at least partly) by the ITS#5434
change. I.e it was caused by the be_search() in syncprov_playlog() not
being propagated to the subordinate database and thereby not finding the
entry that was modified.
The bad news is that there is still a race condition that would have the
same effect. If someone deletes the entry that was modified just before
be_search() is done it will return the same empty cookie. I was able to
reproduce this in a debugger by setting a break point before be_search(),
deleting the entry with ldapdelete and in the debugger run next over the
be_search() call before continuing.
So, I still think it is best to keep delcsn[0].bv_val NULL until an
actual cookie has been written to it, as slap_compose_sync_cookie() will
create a cookie with an empty csn= if passed a csn array where bv_len is
zero but bv_val is non-NULL. Although an alternative and maybe more
correct fix could be to change slap_compose_sync_cookie() instead?
Rein
[View Less]
Duncan Gibb wrote:
> hyc(a)symas.com wrote:
> HYC> syncrepl.c is now updated in HEAD to allow this replication to occur.
>
> HYC> Using your test, I don't see any crash.
>
> Thank you. We'll re-test with current HEAD and post the outcome.
I modified your test case to make it easier to run. The new test is in
ftp.openldap.org:incoming/its5430.tgz.1
You simply extract it inside the tests/ subdirectory of the build tree. After
you have built the OpenLDAP source and …
[View More]run "make test" you can cd into the
its5430 directory and run the test.sh from there.
There were also some errors in your config file templates, which are corrected
in the above archive.
Most notably, in template 1, you explicitly configured syncprov on top of glue
instead of taking the default behavior of syncprov under glue. As documented
in slapd.conf(5), you should only do that when you want a single syncprov
overlay to be master of the entire tree. In this case, you clearly want
syncprov to only master one portion of the tree, with the subordinate being
taken care of by its own syncprov overlay. With your incorrect configuration,
server3 would receive redundant updates because the subordinate branch would
be propagated twice.
You also inserted the overlay directives in the middle of your database
clauses, which is poor practice (and caused crashes in older OpenLDAP
releases). Again, as documented, overlays are meant to be configured at the
end of their respective databases.
Not an error, but you used unique rids across all of the config files. rids
only need to be unique within a single server.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
hyc(a)symas.com wrote:
DG> We've scripted a test which works for me against 2.4.8 and did work
DG> against CVS HEAD until Howard committed
Apologies for the vagueness of last night's email.
For a script whose aim is to demo a crash, "works" means "reproduces the
crash with no more intervention than invoking it and pressing enter when
prompted". I was trying to explain that in cases where replication
doesn't occur, the test can't be done, but on re-reading in daylight
perhaps I failed …
[View More]on that score ;-)
There were two commits five minutes apart. I could reproduce the crash
with syncrepl.c 1.387; replication 2-->1 did not occur with 1.389; I was
just about to test with 1.388 when I read your second post.
HYC> syncrepl.c is now updated in HEAD to allow this replication to occur.
HYC> Using your test, I don't see any crash.
Thank you. We'll re-test with current HEAD and post the outcome.
Cheers
Duncan
[View Less]