OpenLDAP v. 2.3.32 Berkeley DB 4.6 gcc 4.1.0
Replication doesn't work if the master server is started after the replica servers and a large amount of simoultaneous updates are performed while the server is starting up.
The entries that didn't get replicated to the replicas will not be replicated even after a restart of both master and replicas. The contextCSN is set to a value larger than the entryCSN of the "lost" entries.
This is what I think happens during a master server startup with simoultaneous updates ongoing (and replicas trying to sync in the initial phase).
Suppose that two clients (Client1 and Client2) are adding the entries a and b respectively. If that happens between t1 and t2 (one second between) they will get the same entryCSN (same timestamp). If entry a is committed at tc1 and b at tc2, any replica search inbetween will only get the entry a. The entry b will be lost.
Client1 entry=a, csn=x
Client2 entry=b, csn=x
Timeline ------+----------+---------+----+------> | | t1 | | t2=t1+1 | | tc1=entry a tc2=entry b committed committed
Replica search query between tc1 and tc2.
I don't know if a higher granularity would prevent this, or even better, to have some kind of a counter so that every modification gets a unique csn.
Can you please comment on our analyzis to let us know if the analyzis is correct or if we have missed something important?
Any help or hints on how to avoid or fix this problem is greatly appreciated.
If I receive useful information direcly in private email, I will post a summary.
Regards
Stelios Grigoriadis