Issue in syncprov findcsn code

List overview All Threads
Download

newer

older

(ITS#5705) [enhancement]...

syncrepl failed without errors...

Pierangelo Masarati

2 Sep 2008 2 Sep '08

10:22 a.m.

Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):

<snip> again: switch( mode ) { case FIND_MAXCSN: cf.f_choice = LDAP_FILTER_GE; /* If there are multiple CSNs, use the one with our serverID */ for ( i=0; i<si->si_numcsns; i++) { if ( slap_serverID == si->si_sids[i] ) { maxid = i; break; } } </snip>

When run by a consumer, with no serverID set (and thus slap_serverID == 0), it causes the consumer to use the contextCSN with SID == "000" instead of the most recent one. As a consequence, if one searches the contextCSN within that consumer, slapo-syncprov's syncprov_operational() causes only that value to be returned, instead of all contextCSNs. After the consumer is restarted, all values are correctly returned. To reproduce:

- populate a DSA (SIDs in CSNs will default to "000")

- turn it into a (multi)master by adding the serverID statement (with SID > "000") and so

- perform a modification (so that the most recent contextCSN will have SID != "000")

- create a consumer (no serverID statement, so that it defaults to "000") and let it pull data from the producer

- search the contextCSN of the consumer

I'm not sure this is a bug; it might be harmless, apart from being definitely misleading. There might be multiple solutions:

- don't let syncprov_operational() muck with contextCSN that way

- make syncprov_findcsn() search the newest contextCSN instead of the one with its SID

- initialize slapd_serverID with some SID_UNDEFINED in order to take the action above only when SID is not defined

Ing. Pierangelo Masarati OpenLDAP Core Team

SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------

Show replies by date

Pierangelo Masarati

2 Sep 2 Sep

11:30 a.m.

Pierangelo Masarati wrote:

...

Not sure this is a bug, but I'm curious...

Apparently, the issue is a bit different. The wrong contextCSN comes from somewhere else. When the consumer starts empty, syncprov_db_open() does not find the context entry. Subsequently, after the refresh phase, the wrong (and only) contextCSN is taken from the pending CSN list in slap_get_commit_csn(). It seems to be syncrepl_updateCookie() who uses the wrong contextCSN in slap_queue_csn(). Fixing...

Ing. Pierangelo Masarati OpenLDAP Core Team

Rein Tollevik

16 Sep 16 Sep

1:02 p.m.

Pierangelo Masarati wrote:

...

Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):

[code and discussion removed]

...

make syncprov_findcsn() search the newest contextCSN instead of the

one with its SID

Only looking for contextCSN values with sid matching the serverID was introduced in revision 1.240 to fix ITS#5537.

...

initialize slapd_serverID with some SID_UNDEFINED in order to take the

action above only when SID is not defined

I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.

Btw, the ITS#5675 fix to syncrepl.c improves the contextCSN propagation from syncrepl to syncprov, but the csn queue isn't really suitable for that. Syncrepl may update more than one contextCSN value at the same time, but the queue can only pass one around. I'm currently testing a patch that fixes the contextCSN propagation problems we have seen, it should fix this as well.

Rein

Howard Chu

7:07 p.m.

Rein Tollevik wrote:

...

Pierangelo Masarati wrote:

...
Not sure this is a bug, but I'm curious... I hit this while checking for ITS#5661. The code below is from HEAD's syncprov.c:613 (not changed recently; pardon any unintended line wrapping):

[code and discussion removed]

...

make syncprov_findcsn() search the newest contextCSN instead of the

one with its SID

Only looking for contextCSN values with sid matching the serverID was introduced in revision 1.240 to fix ITS#5537.

...

initialize slapd_serverID with some SID_UNDEFINED in order to take the

action above only when SID is not defined

I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.

Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.

By the way, re: the current test050 failures in RE24, I saw the failure occur again even with the latest syncrepl.c patch reverted, so it appears that was just coincidental the last time.

...

Btw, the ITS#5675 fix to syncrepl.c improves the contextCSN propagation from syncrepl to syncprov, but the csn queue isn't really suitable for that. Syncrepl may update more than one contextCSN value at the same time, but the queue can only pass one around. I'm currently testing a patch that fixes the contextCSN propagation problems we have seen, it should fix this as well.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Rein Tollevik

17 Sep 17 Sep

12:29 p.m.

Howard Chu skrev:

...

Rein Tollevik wrote:

...
Pierangelo Masarati wrote:

...

...
...

initialize slapd_serverID with some SID_UNDEFINED in order to take the

action above only when SID is not defined

I agree, although I would prefer to take it one step further and reserve serverID==0 for the tools case. In ITS#5536 I tried to distinguish between a defaulted and configured serverID==0, but it didn't quite slip through and was closed without being properly fixed. It should probably be reopened.

Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.

This is not how I read the doc nor the source. But if it was like this then it should be what I need :-) To enforce it syncprov must be changed so that:

If serverID is 0 it should only allow one contextCSN value, and it should have 0 in the sid field. Maybe not required to enforce, but it should help to quickly identify incorrectly configured servers.

If serverID is not 0 it should not accept contextCSN values from syncrepl with 0 in the sid field, to make sure it don't receives updates from a single-master configured server.

If serverID is not 0 it must ignore contextCSN values with 0 in the sid field read from the database. This is to allow a single-master server to be promoted to a multi-master without leaving the old sid=0 csn around forever. Hmm, if a csn with sid=0 is found, but none with the serverID value, then it could maybe be better to replace the sid in that csn? More hmm, when starting up it would probably be correct to include entries with 0 in the sid fields of their entryCSN value in those that could cause the current servers contextCSN to be updated? I expect I'm not the only one that forgets to add the -S argument to slapadd...

The serverID in existing mirrormode/multimaster configurations that uses 0 as the value must be changed, but this should be all that is needed when upgrading to this version.

What would be the correct action if a contextCSN with an invalid sid value is received from syncrepl? Asserting it could be a bit too strict, better to ignore the value and complain loudly in the logs?

Does this make any sense? If so, I'll volunteer to implement.

Rein

Pierangelo Masarati

12:46 p.m.

Rein Tollevik wrote:

...

...
Well, a serverID of 0 is basically the same as no serverID. For mirrormode/multimaster the serverID must be non-zero. For single-master the serverID must be zero.

This is not how I read the doc nor the source. But if it was like this then it should be what I need :-) To enforce it syncprov must be changed so that:

If serverID is 0 it should only allow one contextCSN value, and it should have 0 in the sid field. Maybe not required to enforce, but it should help to quickly identify incorrectly configured servers.

If serverID is not 0 it should not accept contextCSN values from syncrepl with 0 in the sid field, to make sure it don't receives updates from a single-master configured server.

If serverID is not 0 it must ignore contextCSN values with 0 in the sid field read from the database. This is to allow a single-master server to be promoted to a multi-master without leaving the old sid=0 csn around forever. Hmm, if a csn with sid=0 is found, but none with the serverID value, then it could maybe be better to replace the sid in that csn? More hmm, when starting up it would probably be correct to include entries with 0 in the sid fields of their entryCSN value in those that could cause the current servers contextCSN to be updated? I expect I'm not the only one that forgets to add the -S argument to slapadd...

The serverID in existing mirrormode/multimaster configurations that uses 0 as the value must be changed, but this should be all that is needed when upgrading to this version.

What would be the correct action if a contextCSN with an invalid sid value is received from syncrepl? Asserting it could be a bit too strict, better to ignore the value and complain loudly in the logs?

Does this make any sense? If so, I'll volunteer to implement.

To me, it makes a lot of sense and, well explained in the docs, would greatly help troubleshooting (or even better, set up things the right way right away).

My concerns are:

- do we need to consider all those cases and try to repair them? I'd say: no. Just complain (and refuse to start) if the problem can be solved by running "slapadd -S <SID>" or "slapcat | sed | slapadd".

- the problem should not occur run-time in a homogeneous, well-configured system (== same versions, consistent configuration). If it happens, just give up replication and/or commence a full refresh (agree that assert'ing would be bad).

- slapadd could detect from the configuration whether -S is needed (don't think it could determine the right SID, but at least it could complain, and require a --force (to be implemented) if one retains to know what he's doing).

Ing. Pierangelo Masarati OpenLDAP Core Team

Gavin Henry

2:39 p.m.

...

say: no. Just complain (and refuse to start) if the problem can be solved by running "slapadd -S <SID>" or "slapcat | sed | slapadd".

the problem should not occur run-time in a homogeneous,

well-configured system (== same versions, consistent configuration). If it happens, just give up replication and/or commence a full refresh (agree that assert'ing would be bad).

slapadd could detect from the configuration whether -S is needed

(don't think it could determine the right SID, but at least it could complain, and require a --force (to be implemented) if one retains to

know what he's doing).

Can I confirm the use case here? I've not used the -S option and it sounds very important. According to Ando it should be clearly documented too.

Is it used in a MM/N-Way when exporting via slapcat and then importing to another server that will have its own serverID, hence the -S to override the currently exported serverID from the first server?

Thanks.

-- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/

Pierangelo Masarati

2:49 p.m.

Gavin Henry wrote:

...

Can I confirm the use case here? I've not used the -S option and it sounds very important. According to Ando it should be clearly documented too.

Is it used in a MM/N-Way when exporting via slapcat and then importing to another server that will have its own serverID, hence the -S to override the currently exported serverID from the first server?

As far as I know, you don't need it unless you're initializing a MM/N-Way from a clean LDIF (i.e. without entryCSN). Usually, when you restore from a backup, you want existing entryCSN to be preserved. -S only affects the SID portion of entryCSN *generated* by slapadd, i.e. those that were missing in the source LDIF. I added that option some time ago, when I needed to generate a database for a N-Way by importing an LDIF obtained from SunONE. The procedure then was:

- slapadd -w -S 001 -l plain.ldif - slapcat -l full.ldif - scp full.ldif other-n-way:

on other-n-way:

- slapadd -l full.ldif

This way, all N-Way would get the same database with the SID of the first one, "001". As an alternative, I could have fired up the first one and let the others sync.

Ing. Pierangelo Masarati OpenLDAP Core Team

Gavin Henry

2:55 p.m.

----- "Pierangelo Masarati" ando@sys-net.it wrote:

...

Gavin Henry wrote:

...
Can I confirm the use case here? I've not used the -S option and it

sounds

...
very important. According to Ando it should be clearly documented

too.

...
Is it used in a MM/N-Way when exporting via slapcat and then

importing to

...
another server that will have its own serverID, hence the -S to

override the

...
currently exported serverID from the first server?

As far as I know, you don't need it unless you're initializing a MM/N-Way from a clean LDIF (i.e. without entryCSN). Usually, when you

restore from a backup, you want existing entryCSN to be preserved. -S

only affects the SID portion of entryCSN *generated* by slapadd, i.e.

those that were missing in the source LDIF. I added that option some

time ago, when I needed to generate a database for a N-Way by importing an LDIF obtained from SunONE. The procedure then was:

slapadd -w -S 001 -l plain.ldif

slapcat -l full.ldif

scp full.ldif other-n-way:

on other-n-way:

slapadd -l full.ldif

OK, that's perfectly clear.

...

This way, all N-Way would get the same database with the SID of the first one, "001". As an alternative, I could have fired up the first

one and let the others sync.

Yes, the later way is done by most folks except by the ones who have massive data sets and can't/won't sync online.

Thanks for the clear up.

Gavin.

-- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/

6136

Age (days ago)

6151

Last active (days ago)

openldap-bugs@openldap.org

8 comments

4 participants

tags (0)

participants (4)

Gavin Henry
Howard Chu
Pierangelo Masarati
Rein Tollevik