We're experiencing some strange issues with mirrormode; I don't have the whole picture at hand, right now, and I haven't been able to create the issue in a repeatable manner, however it happens regularly. The scenario is: 2 mirror-mode servers, each with 2 databases, plus back-config, replicated using syncrepl refreshAndPersist. We notice occasional sigsegv, and we could finally get a core dump, which contains
#0 0x080e0e6f in compare_csns (sc1=0x33a96a80, sc2=0x33a96aa0, which=0x33a9682c) at ../../../servers/slapd/syncrepl.c:665 665 for (j=0; !BER_BVISNULL( &sc2->ctxcsn[j] ); j++) { (gdb) bt full #0 0x080e0e6f in compare_csns (sc1=0x33a96a80, sc2=0x33a96aa0, which=0x33a9682c) at ../../../servers/slapd/syncrepl.c:665 i = 0 j = 0 match = 0 text = 0x80e0e1a "\201� #1 0x080e7385 in do_syncrep2 (op=0x33a96d40, si=0x919c548) at ../../../servers/slapd/syncrepl.c:990 i = Variable "i" is not available.
(gdb) p *sc1 $2 = {ctxcsn = 0x9981310, octet_str = {bv_len = 60, bv_val = 0x9d2b680 "rid=004,sid=000,csn=20070925132254.897919Z#000000#000#000000"}, rid = 4, sid = 0, numcsns = 1, sids = 0x9ce3120, sc_next = {stqe_next = 0x0}} (gdb) p *sc2 $3 = {ctxcsn = 0x0, octet_str = {bv_len = 20, bv_val = 0x9c299f0 "rid=004,sid=000,csn="}, rid = 4, sid = 0, numcsns = 0, sids = 0x0, sc_next = {stqe_next = 0x0}}
p sc1->ctxcsn[0] $4 = {bv_len = 40, bv_val = 0x9e98cf0 "20070925132254.897919Z#000000#000#000000"} (gdb) p sc1->ctxcsn[1] $5 = {bv_len = 0, bv_val = 0x0} (gdb) p sc1->sids[0] $6 = 0 (gdb) p sc1->sids[1] $7 = 159661424 (gdb) p sc2->ctxcsn[0] $8 = {bv_len = 0, bv_val = 0x0} (gdb) p sc2->sids[0] $9 = 0
Line numbers may be slightly off because of minimal customization which should not impact this functionality.
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie? If it is, the code must account for (not) calling compare_csns() when the ctxcsn member of either struct sync_cookie is NULL. In that case, I suggest the patch
Index: servers/slapd/syncrepl.c =================================================================== RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/syncrepl.c,v retrieving revision 1.361 diff -u -r1.361 syncrepl.c --- servers/slapd/syncrepl.c 29 Sep 2007 14:11:28 -0000 1.361 +++ servers/slapd/syncrepl.c 1 Oct 2007 12:25:05 -0000 @@ -985,7 +985,13 @@ if ( !BER_BVISNULL( &syncCookie.octet_str ) ) { slap_parse_sync_cookie( &syncCookie, NULL ); - compare_csns( &syncCookie_req, &syncCookie, &m ); + m = 0; + if ( syncCookie.ctxcsn ) { + compare_csns( &syncCookie_req, &syncCookie, &m ); + } else { + /* otherwise it would be dereferenced few lines below */ + assert( !refreshDeletes ); + } } } if ( ber_peek_tag( ber, &len ) ==
So the second question, assuming the above cookie is legitimate, would be: is it legitimate that the above cookie will only occur with refreshDelete unset?
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie?
As far as our implementation goes, no. slap_compose_sync_cookie() should never produce this.
I note that slap_compose_sync_cookie() is still generating the rid with %03d, but since you've changed the rid to be valid up to 4095, it should %03x.
Pierangelo Masarati wrote:
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie?
As far as our implementation goes, no. slap_compose_sync_cookie() should never produce this.
I could reproduce it, but unfortunately I don't have much info available. I have instrumented syncprov.c to log the cookie any time it is prepared, and the incorrect cookie was generated by slap_compose_sync_cookie() when invoked by syncprov_sendresp() after a successful add. I only saw the log, so I couldn't inspect memory during that operation, so I'm trying to narrow things down to a small and quickly reproducible example. Apparently, opc->sctxcsn contains an empty berval, otherwise no "csn=" portion would apprear in the cookie.
I note that slap_compose_sync_cookie() is still generating the rid with %03d, but since you've changed the rid to be valid up to 4095, it should %03x.
Fixed, thanks.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie?
As far as our implementation goes, no. slap_compose_sync_cookie() should never produce this.
I could reproduce it, but unfortunately I don't have much info available. I have instrumented syncprov.c to log the cookie any time it is prepared, and the incorrect cookie was generated by slap_compose_sync_cookie() when invoked by syncprov_sendresp() after a successful add. I only saw the log, so I couldn't inspect memory during that operation, so I'm trying to narrow things down to a small and quickly reproducible example. Apparently, opc->sctxcsn contains an empty berval, otherwise no "csn=" portion would apprear in the cookie.
I note that in syncprov_op_response(), if slap_get_commit_csn() can't get a maxcsn, opc->sctxcsn would be set to an empty berval, and this would trigger the above cookie. I wonder if this is possible at all, and if setting opc->sctxcsn to a NULL berval would just cure the problem.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie?
As far as our implementation goes, no. slap_compose_sync_cookie() should never produce this.
I could reproduce it, but unfortunately I don't have much info available. I have instrumented syncprov.c to log the cookie any time it is prepared, and the incorrect cookie was generated by slap_compose_sync_cookie() when invoked by syncprov_sendresp() after a successful add. I only saw the log, so I couldn't inspect memory during that operation, so I'm trying to narrow things down to a small and quickly reproducible example. Apparently, opc->sctxcsn contains an empty berval, otherwise no "csn=" portion would apprear in the cookie.
I note that in syncprov_op_response(), if slap_get_commit_csn() can't get a maxcsn, opc->sctxcsn would be set to an empty berval, and this would trigger the above cookie. I wonder if this is possible at all, and if setting opc->sctxcsn to a NULL berval would just cure the problem.
I think that will just mask the problem. If you're in an Add operation, there must be a CSN already. So perhaps we should assert() in slap_get_commit_csn and find out why it returned without one.
Pierangelo Masarati wrote:
I note that in syncprov_op_response(), if slap_get_commit_csn() can't get a maxcsn, opc->sctxcsn would be set to an empty berval, and this would trigger the above cookie. I wonder if this is possible at all, and if setting opc->sctxcsn to a NULL berval would just cure the problem.
I think that will just mask the problem. If you're in an Add operation, there must be a CSN already. So perhaps we should assert() in slap_get_commit_csn and find out why it returned without one.
I'm not saying that's where the error occurred, since I couldn't track what happened. I'm just looking for places where it could have occurred. Note that the add occurred on server 0, while the broken cookie was cooked on server 1 when the two were playing mirrormode. So the one acting as consumer didn't commit anything, and my guess is that for this reason it hadn't any maxcsn available.
I note that test050 tests mirrormode in refreshOnly; is mirrormode supposed to work in refreshAndPersist as well? The test seems to work as expected...
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
I note that test050 tests mirrormode in refreshOnly
OK, forget about this: the test is a mix of refreshOnly (the actual database) and refreshAndPersist (the config database).
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
I've backed out RID, which is now decimal again (0..999), and let sctxcsn be NULL when no maxcsn can be gathered. In this case, after running test050 and leaving the directories in place, if I restart the two servers and perform an add as the first operation, the server acting as a consumer crashes both in refreshOnly and in refreshAndPersist modes (I plan to modify the test in order to trigger this case, so we can track regressions once it's fixed). My question is now:
1) is it correct that syncprov_op_response() is set up at all for those modifications that come from the producer?
2) If it is, should maxcsn be available?
3) If not, how should its absence be dealt with?
4) otherwise, if yes, why isn't it available?
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
I've backed out RID, which is now decimal again (0..999), and let sctxcsn be NULL when no maxcsn can be gathered. In this case, after running test050 and leaving the directories in place, if I restart the two servers and perform an add as the first operation, the server acting as a consumer crashes both in refreshOnly and in refreshAndPersist modes (I plan to modify the test in order to trigger this case, so we can track regressions once it's fixed). My question is now:
- is it correct that syncprov_op_response() is set up at all for those
modifications that come from the producer?
Obviously yes, to enable cascaded replication
- If it is, should maxcsn be available?
don't know yet
- If not, how should its absence be dealt with?
it should definitely be dealt with: cascaded replication incurs into this issue; after adding cookie logging, testrun/slapd.4.log after test019 is full of
syncprov_sendresp: cookie=rid=001,csn=
- otherwise, if yes, why isn't it available?
dunno yet.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
I've backed out RID, which is now decimal again (0..999), and let sctxcsn be NULL when no maxcsn can be gathered. In this case, after running test050 and leaving the directories in place, if I restart the two servers and perform an add as the first operation, the server acting as a consumer crashes both in refreshOnly and in refreshAndPersist modes (I plan to modify the test in order to trigger this case, so we can track regressions once it's fixed). My question is now:
- is it correct that syncprov_op_response() is set up at all for those
modifications that come from the producer?
Obviously yes, to enable cascaded replication
- If it is, should maxcsn be available?
don't know yet
- If not, how should its absence be dealt with?
it should definitely be dealt with: cascaded replication incurs into this issue; after adding cookie logging, testrun/slapd.4.log after test019 is full of
syncprov_sendresp: cookie=rid=001,csn=
- otherwise, if yes, why isn't it available?
dunno yet.
The syncrepl_entry() function doesn't call slap_queue_csn()/slap_graduate_commit_csn(), therefore the CSN is not propagated. The reason for that is because during a refresh phase, updates are not received in CSN order. We only propagate the CSN when we receive a new cookie from the provider, and the provider will only send it when the refresh phase is completed. This is to prevent the local contextCSN from advancing past its actual content...
Once we transition into persist phase, the provider should be sending a cookie with every update, and we should be propagating it. It looks like perhaps syncrepl_entry() is broken here in that it doesn't know whether or not a cookie was received. I'll look into that.
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
I've backed out RID, which is now decimal again (0..999), and let sctxcsn be NULL when no maxcsn can be gathered. In this case, after running test050 and leaving the directories in place, if I restart the two servers and perform an add as the first operation, the server acting as a consumer crashes both in refreshOnly and in refreshAndPersist modes (I plan to modify the test in order to trigger this case, so we can track regressions once it's fixed). My question is now:
- is it correct that syncprov_op_response() is set up at all for those
modifications that come from the producer?
Obviously yes, to enable cascaded replication
- If it is, should maxcsn be available?
don't know yet
- If not, how should its absence be dealt with?
it should definitely be dealt with: cascaded replication incurs into this issue; after adding cookie logging, testrun/slapd.4.log after test019 is full of
syncprov_sendresp: cookie=rid=001,csn=
This is probably an artifact of syncprov_qresp() which always allocates at least one byte for the queued CSN, thus the bv_val is never NULL. Which is why the test in compose_sync_cookie() was supposed to check BVISEMPTY, not BVISNULL.
Pierangelo Masarati wrote:
This is probably an artifact of syncprov_qresp() which always allocates at least one byte for the queued CSN, thus the bv_val is never NULL. Which is why the test in compose_sync_cookie() was supposed to check BVISEMPTY, not BVISNULL.
I still strongly suspect that the BVISEMPTY test was just hiding a can of worms. I was running a sequence of add/delete on server 1; I'll call producer the server that receives the sequence of add/delete, and consumer the one that is replicated; however the trouble is caused when they act the opposite: it's syncprov on the consumer that sends bogus stuff to syncrepl on the producer. With yesterday's fixes, and after making sure each server was recognizing itself with the right SID (1 and 2), I caught the consumer sending back a bad cookie when playing the log. In any case, there appears to be an erratic behavior when performing deletes. What I infer from the logs is that the consumer sends back modifications to the producer with an incorrect cookie (sid=000), and either does not perform deletes, or does not perform them in time, and this causes trouble. I attach sanitized logs of both producer and consumer, in case they help, and I'll keep investigating.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
Mi first question is: is "rid=004,sid=000,csn=" a legitimate cookie?
As far as our implementation goes, no. slap_compose_sync_cookie() should never produce this.
I could reproduce it, but unfortunately I don't have much info available. I have instrumented syncprov.c to log the cookie any time it is prepared, and the incorrect cookie was generated by slap_compose_sync_cookie() when invoked by syncprov_sendresp() after a successful add. I only saw the log, so I couldn't inspect memory during that operation, so I'm trying to narrow things down to a small and quickly reproducible example. Apparently, opc->sctxcsn contains an empty berval, otherwise no "csn=" portion would apprear in the cookie.
This sounds like we're being bitten by the conversion of the original BER_BVISEMPTY checks to BER_BVISNULL. As I noted at that time, this was probably going to break.
Pierangelo Masarati wrote:
Pierangelo Masarati wrote: I note that slap_compose_sync_cookie() is still generating the rid with %03d, but since you've changed the rid to be valid up to 4095, it should %03x.
Fixed, thanks.
Of course, now slap_parse_sync_cookie() also needs to be changed accordingly.
Tell me again why we changed the range of RIDs? They weren't broken before, why muck with them?
Howard Chu wrote:
Pierangelo Masarati wrote:
Pierangelo Masarati wrote: I note that slap_compose_sync_cookie() is still generating the rid with %03d, but since you've changed the rid to be valid up to 4095, it should %03x.
Fixed, thanks.
Of course, now slap_parse_sync_cookie() also needs to be changed accordingly.
Bah. Re-reading the CVS logs, the RID parse/generation was intentionally left in decimal before. Which makes sense, since that preserves compatibility with 2.3. I was looking for that explanation before in my email logs but missed it until now.
Tell me again why we changed the range of RIDs? They weren't broken before, why muck with them?
Howard Chu wrote:
Bah. Re-reading the CVS logs, the RID parse/generation was intentionally left in decimal before. Which makes sense, since that preserves compatibility with 2.3. I was looking for that explanation before in my email logs but missed it until now.
Tell me again why we changed the range of RIDs? They weren't broken before, why muck with them?
... and all of this was triggered by moving the SID to hex. We can still back the RID out, since it has never been announced nor documented intentionally.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote: I note that slap_compose_sync_cookie() is still generating the rid with %03d, but since you've changed the rid to be valid up to 4095, it should %03x.
Fixed, thanks.
Of course, now slap_parse_sync_cookie() also needs to be changed accordingly.
Tell me again why we changed the range of RIDs? They weren't broken before, why muck with them?
Because, by analogy with SIDs. I know they're serve totally different purposes, but having two things that are so similar in name and syntax, are used most of the time one along each other, but use different base and range looked odd to me.
But this has nothing to do with the issue of this thread, I believe.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------