Burton, Kris - Acision wrote:
All,
I want to ask the list about this before I try to open an ITS to make sure that I am understanding everything correctly. We are running OpenLDAP 2.4.11. I selectively tried to back post ITS 5709 to our source, because we were losing replications. Applying this seemed to help and reduced the number of lost replications. We are running in mirror mode using refreshAndPersist, and doing a high volume of adds to the master, on the order of 100/s. We have run numerous iterations of the same test with very aggressive NTP updates that are keeping both the master and consumer within 50 microseconds of one another. Which I saw recommended as a possible solution in a previous message thread. This seemed to make little to no difference in the replication loss.
If you're actually using MirrorMode, with all writes going to only one server, then NTP doesn't really matter. The time synchronization is only important when reconciling concurrent updates that occurred on different servers. I.e., it's only important when you're running multimaster (as opposed to mirrormode), and for reconciling any updates that occurred while a MirrorMode failover was happening. From the sounds of it, your test doesn't trigger these criteria.
From looking at the code I was thinking that the lost replications might be due to entries being queued on the master side in non-ascending order which I was seeing preceding the replication that would be rejected on the consumer side. What I thought was happening is that the logic that traverses the queue to mark committed CSNs and updates the contextCSN was getting out of sync because of this, and orphaning replications that were still pending, because they are too old, but in reality they have never been added to the consumer.
Looking at your debug info, this sounds likely. Yes, please submit this info to the ITS.
I just pulled the latest code from RE24 and reran the test, the latest code is better than before with just the back post of 5709 on 2.4.11, but we are still losing a small percentage of the replications with the “CSN too old” message. With the latest code I am still seeing a correlation between the out of sync queuing on the master and the replications that are rejected on the consumer.