Joachim Hergeth wrote:
The initial synchronization of the consumer works as expected. All LDAP entries are copied to the consumer directory. But after some time, usually when users log in into the Samba running with the provider LDAP, nearly 50% of all LDAP entries on the consumer are deleted. This happens without any change on the provider LDAP!
I am experiencing a similar problem, database entries being deleted when they shouldn't be, on a number of replicas. This is causing disruption to users and we'd really like to get it cleared up.
Here's the environment: OpenLDAP 2.3.32 running on Debian 3.1 (Sarge) compiled with sync logging patch discussed about 4 months ago loglevel config sync on all servers BDB 4.2 backend Syncrepl replication all round A "master" server (com) - holds the master copy of the database A number of servers that replicate directly from com An "intermediate" server (wwsv04) that - is on the same LAN and subnet as com - replicates from com - acts as provider for all other servers 88 servers/replicas in total Approx 9000 records All replicas are supposed to be complete copies Nothing particularly fancy or clever going on
I have spent today dissecting the logs from two incidents this week in which entries were erroneously deleted. Although the circumstances of the two incidents are quite different, from examining the logs I believe it is the same thing happening in each case.
Incident 1: A test server ran out of space in /var. After cleaning up, rebooting and doing db_recover, 244 entries were erroneously deleted. This server is on the same building network as com and wwsv04 (different subnet), and replicates from wwsv04.
Incident 2: A production server on the other side of the world lost contact with its provider, and after reconnecting, erroneously deleted 249 entries. This server is connected to the main network via an OpenVPN tunnel across the internet, and replicates from com. (The really odd thing about this is that the network link came back up over an hour 1hr before the first timestamp of this incident, but that is most likely a separate issue.)
I have attached three files (I hope they're not too big): - log_analysis.ods - an OpenOffice spreadsheet containing correlated log entries and my comments for both incidents - incident_1.tgz - syslogs relating to incident 1 - and a CSV version of my analysis of incident 1 - incident_2.tgz - likewise for incident 2
Replication configs are as follows: --------------------------------- com --- # Syncrepl provider
overlay syncprov syncprov-checkpoint 10 5 syncprov-sessionlog 100 syncprov-reloadhint TRUE ------------------------------------- wwsv04 ------ # Syncrepl provider
overlay syncprov syncprov-checkpoint 10 5 syncprov-sessionlog 100 syncprov-reloadhint TRUE
# syncrepl consumer
syncrepl rid=123 provider=ldap://com.example.co.nz type=refreshAndPersist searchbase="dc=example,dc=co,dc=nz" scope=sub schemachecking=off bindmethod=simple binddn="cn=root,dc=example,dc=co,dc=nz" credentials=secret retry=5,5,30,5,60,5,300,+ ------------------------------------- zzsv01 ------ syncrepl rid=123 provider=ldap://wwsv04.example.co.nz type=refreshAndPersist searchbase="dc=example,dc=co,dc=nz" scope=sub schemachecking=off bindmethod=simple binddn="cn=root,dc=example,dc=co,dc=nz" credentials=secret retry=5,5,30,5,60,5,300,+ ------------------------------------- nosv01 ------ syncrepl rid=123 provider=ldap://com.example.co.nz type=refreshAndPersist searchbase="dc=example,dc=co,dc=nz" scope=sub schemachecking=off bindmethod=simple binddn="cn=root,dc=example,dc=co,dc=nz" credentials=secret retry=5,5,30,5,60,5,300,+ -------------------------------------