On Fri, 2007-09-28 at 17:02 -0700, Howard Chu wrote:
Stelios Grigoriadis wrote:
I have upgraded openldap to latest stable version (2.3.38) and used Berkeley DB version 4.5.20. The problem remains. I realize my analisys wasn't correct since, as Howard Chu pointed out, the CSN contains both a timestamp and a counter. So the entryCSN:s are unique.
But, the problem remains and I have no idea why this happens. I somehow still suspect that the problem still is in the initial phase of the sync operation (refresh stage). It might be that, some of the not-yet committed modifications don't make it into the result set in the search operation. Later after another entry is added, the "lost" entries are never to be synced over.
This also cannot be the cause. The contextCSN is snapshotted at the beginning of a refresh. Only updates between the consumer's cookie CSN and the snapshot CSN are sent to the consumer. Any entries added during this refresh will be excluded from the update, and the consumer will then record the snapshot CSN. Any entries the consumer didn't pick up in this refresh pass will be picked up in the next refresh.
I agree with you, I just didn't see the "next refresh" in the code. I thought it refreshed only once and then the master would write back all subsequent changes (syncprov_op_response -> syncprov_qstart etc.)
I will test some more and try to provide more information. I have a test program that generates this problem but it is a little cumbersome. I will try to slim it down and use more common schema elements before posting it.
That will certainly help.
The setup to reproduce the error is as follows: 1 master, 3 replicas.
1. Start the replicas. 2. Start the program that adds persons (parallell_stress_simple.sh). Actually a script that starts a number of processes (add_person.c) on different machines that add persons. 3. Start the master. 4. When the script completes, compare the number of added entries in the master and replicas.
To Quanah Gibson-Mount: The slapd.conf i also provided.
/Stelios