Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):
<snip> again: switch( mode ) { case FIND_MAXCSN: cf.f_choice = LDAP_FILTER_GE; /* If there are multiple CSNs, use the one with our serverID */ for ( i=0; i<si->si_numcsns; i++) { if ( slap_serverID == si->si_sids[i] ) { maxid = i; break; } } </snip>
When run by a consumer, with no serverID set (and thus slap_serverID == 0), it causes the consumer to use the contextCSN with SID == "000" instead of the most recent one. As a consequence, if one searches the contextCSN within that consumer, slapo-syncprov's syncprov_operational() causes only that value to be returned, instead of all contextCSNs. After the consumer is restarted, all values are correctly returned. To reproduce:
- populate a DSA (SIDs in CSNs will default to "000")
- turn it into a (multi)master by adding the serverID statement (with SID > "000") and so
- perform a modification (so that the most recent contextCSN will have SID != "000")
- create a consumer (no serverID statement, so that it defaults to "000") and let it pull data from the producer
- search the contextCSN of the consumer
I'm not sure this is a bug; it might be harmless, apart from being definitely misleading. There might be multiple solutions:
- don't let syncprov_operational() muck with contextCSN that way
- make syncprov_findcsn() search the newest contextCSN instead of the one with its SID
- initialize slapd_serverID with some SID_UNDEFINED in order to take the action above only when SID is not defined
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Not sure this is a bug, but I'm curious...
Apparently, the issue is a bit different. The wrong contextCSN comes from somewhere else. When the consumer starts empty, syncprov_db_open() does not find the context entry. Subsequently, after the refresh phase, the wrong (and only) contextCSN is taken from the pending CSN list in slap_get_commit_csn(). It seems to be syncrepl_updateCookie() who uses the wrong contextCSN in slap_queue_csn(). Fixing...
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):
[code and discussion removed]
- make syncprov_findcsn() search the newest contextCSN instead of the
one with its SID
Only looking for contextCSN values with sid matching the serverID was introduced in revision 1.240 to fix ITS#5537.
- initialize slapd_serverID with some SID_UNDEFINED in order to take the
action above only when SID is not defined
I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.
Btw, the ITS#5675 fix to syncrepl.c improves the contextCSN propagation from syncrepl to syncprov, but the csn queue isn't really suitable for that. Syncrepl may update more than one contextCSN value at the same time, but the queue can only pass one around. I'm currently testing a patch that fixes the contextCSN propagation problems we have seen, it should fix this as well.
Rein
Rein Tollevik wrote:
Pierangelo Masarati wrote:
Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):
[code and discussion removed]
- make syncprov_findcsn() search the newest contextCSN instead of the
one with its SID
Only looking for contextCSN values with sid matching the serverID was introduced in revision 1.240 to fix ITS#5537.
- initialize slapd_serverID with some SID_UNDEFINED in order to take the
action above only when SID is not defined
I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.
Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.
By the way, re: the current test050 failures in RE24, I saw the failure occur again even with the latest syncrepl.c patch reverted, so it appears that was just coincidental the last time.
Btw, the ITS#5675 fix to syncrepl.c improves the contextCSN propagation from syncrepl to syncprov, but the csn queue isn't really suitable for that. Syncrepl may update more than one contextCSN value at the same time, but the queue can only pass one around. I'm currently testing a patch that fixes the contextCSN propagation problems we have seen, it should fix this as well.
Howard Chu skrev:
Rein Tollevik wrote:
Pierangelo Masarati wrote:
- initialize slapd_serverID with some SID_UNDEFINED in order to take the
action above only when SID is not defined
I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.
Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.
This is not how I read the doc nor the source. But if it was like this then it should be what I need :-) To enforce it syncprov must be changed so that:
If serverID is 0 it should only allow one contextCSN value, and it should have 0 in the sid field. Maybe not required to enforce, but it should help to quickly identify incorrectly configured servers.
If serverID is not 0 it should not accept contextCSN values from syncrepl with 0 in the sid field, to make sure it don't receives updates from a single-master configured server.
If serverID is not 0 it must ignore contextCSN values with 0 in the sid field read from the database. This is to allow a single-master server to be promoted to a multi-master without leaving the old sid=0 csn around forever. Hmm, if a csn with sid=0 is found, but none with the serverID value, then it could maybe be better to replace the sid in that csn? More hmm, when starting up it would probably be correct to include entries with 0 in the sid fields of their entryCSN value in those that could cause the current servers contextCSN to be updated? I expect I'm not the only one that forgets to add the -S argument to slapadd...
The serverID in existing mirrormode/multimaster configurations that uses 0 as the value must be changed, but this should be all that is needed when upgrading to this version.
What would be the correct action if a contextCSN with an invalid sid value is received from syncrepl? Asserting it could be a bit too strict, better to ignore the value and complain loudly in the logs?
Does this make any sense? If so, I'll volunteer to implement.
Rein
Rein Tollevik wrote:
Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.
This is not how I read the doc nor the source. But if it was like this then it should be what I need :-) To enforce it syncprov must be changed so that:
If serverID is 0 it should only allow one contextCSN value, and it should have 0 in the sid field. Maybe not required to enforce, but it should help to quickly identify incorrectly configured servers.
If serverID is not 0 it should not accept contextCSN values from syncrepl with 0 in the sid field, to make sure it don't receives updates from a single-master configured server.
If serverID is not 0 it must ignore contextCSN values with 0 in the sid field read from the database. This is to allow a single-master server to be promoted to a multi-master without leaving the old sid=0 csn around forever. Hmm, if a csn with sid=0 is found, but none with the serverID value, then it could maybe be better to replace the sid in that csn? More hmm, when starting up it would probably be correct to include entries with 0 in the sid fields of their entryCSN value in those that could cause the current servers contextCSN to be updated? I expect I'm not the only one that forgets to add the -S argument to slapadd...
The serverID in existing mirrormode/multimaster configurations that uses 0 as the value must be changed, but this should be all that is needed when upgrading to this version.
What would be the correct action if a contextCSN with an invalid sid value is received from syncrepl? Asserting it could be a bit too strict, better to ignore the value and complain loudly in the logs?
Does this make any sense? If so, I'll volunteer to implement.
To me, it makes a lot of sense and, well explained in the docs, would greatly help troubleshooting (or even better, set up things the right way right away).
My concerns are:
- do we need to consider all those cases and try to repair them? I'd say: no. Just complain (and refuse to start) if the problem can be solved by running "slapadd -S <SID>" or "slapcat | sed | slapadd".
- the problem should not occur run-time in a homogeneous, well-configured system (== same versions, consistent configuration). If it happens, just give up replication and/or commence a full refresh (agree that assert'ing would be bad).
- slapadd could detect from the configuration whether -S is needed (don't think it could determine the right SID, but at least it could complain, and require a --force (to be implemented) if one retains to know what he's doing).
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
say: no. Just complain (and refuse to start) if the problem can be solved by running "slapadd -S <SID>" or "slapcat | sed | slapadd".
- the problem should not occur run-time in a homogeneous,
well-configured system (== same versions, consistent configuration). If it happens, just give up replication and/or commence a full refresh (agree that assert'ing would be bad).
- slapadd could detect from the configuration whether -S is needed
(don't think it could determine the right SID, but at least it could complain, and require a --force (to be implemented) if one retains to
know what he's doing).
Can I confirm the use case here? I've not used the -S option and it sounds very important. According to Ando it should be clearly documented too.
Is it used in a MM/N-Way when exporting via slapcat and then importing to another server that will have its own serverID, hence the -S to override the currently exported serverID from the first server?
Thanks.
Gavin Henry wrote:
Can I confirm the use case here? I've not used the -S option and it sounds very important. According to Ando it should be clearly documented too.
Is it used in a MM/N-Way when exporting via slapcat and then importing to another server that will have its own serverID, hence the -S to override the currently exported serverID from the first server?
As far as I know, you don't need it unless you're initializing a MM/N-Way from a clean LDIF (i.e. without entryCSN). Usually, when you restore from a backup, you want existing entryCSN to be preserved. -S only affects the SID portion of entryCSN *generated* by slapadd, i.e. those that were missing in the source LDIF. I added that option some time ago, when I needed to generate a database for a N-Way by importing an LDIF obtained from SunONE. The procedure then was:
- slapadd -w -S 001 -l plain.ldif - slapcat -l full.ldif - scp full.ldif other-n-way:
on other-n-way:
- slapadd -l full.ldif
This way, all N-Way would get the same database with the SID of the first one, "001". As an alternative, I could have fired up the first one and let the others sync.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
----- "Pierangelo Masarati" ando@sys-net.it wrote:
Gavin Henry wrote:
Can I confirm the use case here? I've not used the -S option and it
sounds
very important. According to Ando it should be clearly documented
too.
Is it used in a MM/N-Way when exporting via slapcat and then
importing to
another server that will have its own serverID, hence the -S to
override the
currently exported serverID from the first server?
As far as I know, you don't need it unless you're initializing a MM/N-Way from a clean LDIF (i.e. without entryCSN). Usually, when you
restore from a backup, you want existing entryCSN to be preserved. -S
only affects the SID portion of entryCSN *generated* by slapadd, i.e.
those that were missing in the source LDIF. I added that option some
time ago, when I needed to generate a database for a N-Way by importing an LDIF obtained from SunONE. The procedure then was:
- slapadd -w -S 001 -l plain.ldif
- slapcat -l full.ldif
- scp full.ldif other-n-way:
on other-n-way:
- slapadd -l full.ldif
OK, that's perfectly clear.
This way, all N-Way would get the same database with the SID of the first one, "001". As an alternative, I could have fired up the first
one and let the others sync.
Yes, the later way is done by most folks except by the ones who have massive data sets and can't/won't sync online.
Thanks for the clear up.
Gavin.