syncrepl broke, connection loss - openldap-technical

8 Dec 2009


      Hi,
I've loaded my mirror mode setup with data and let it run for a few day,
Both cn=config and the application database is mirrored.
Only server1 is receiving writes from the application.
OpenLDAP 2.4.20, BDB 4.8
After about 6 hours the mirror partly broke and I experience 3 symptoms:
1)
The syncrepl connection from server1->server2 for the application 
database is missing and data only flows from server1 to server2 - not 
the other way. The cn=config connections exists.
$ netstat -tna # shows
tcp    0  0 192.168.0.102:636    0.0.0.0:*            LISTEN
tcp 8125  0 192.168.0.102:45535  192.168.0.101:636    ESTABLISHED
tcp    0  0 192.168.0.102:636    192.168.0.101:34954  ESTABLISHED
tcp    0  0 192.168.0.102:45537  192.168.0.101:636    ESTABLISHED
Where it should show, something like:
tcp    0  0 192.168.0.101:636    0.0.0.0:*            LISTEN
tcp    0  0 192.168.0.101:34954  192.168.0.102:636    ESTABLISHED
tcp  261  0 192.168.0.101:33409  192.168.0.102:636    ESTABLISHED
tcp    0  0 192.168.0.101:636    192.168.0.102:45537  ESTABLISHED
tcp    0  0 192.168.0.101:636    192.168.0.102:33226  ESTABLISHED
2)
Meanwhile the log on server1 says:
Dec  8 02:04:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -1 retrying
Dec  8 02:05:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying
Dec  8 02:06:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying
etc...
The first such entry appear around 6 hours after start of the mirror.
3)
If I try to change cn=config with ldapmodify on either server, server1 
will hang, not answering queries until I restart it.
For instance, if I do:
----------
dn: cn=config
changetype: modify
replace: olcLogLevel
olcLogLevel: None sync
-----------
... it'l hang.
I was able to connect and search the database on both server, to both 
servers like (on server1), using client certs:
ldapsearch -H ldaps://server2/ -YEXTERNAL -b cn=data,dc=example,dc=com 
-s sub -D cn=config '(cn=*)'  + *
So it's not that the TCP connection can't be established.
Which make me suspect that this is related to this thread:
http://www.mail-archive.com/openldap-software@openldap.org/msg16028.html
Now after 27 hours the connection finally came back by it self, and 
replication works both ways.
The "rc -2 retrying" in the log on server1 stopped and was replaced by:
Dec  8 15:39:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying
Dec  8 15:40:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying
Dec  8 15:42:15 server1 slapd[11177]: => bdb_idl_insert_key: c_put id 
failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994)
Dec  8 15:47:05 server1 slapd[11177]: => bdb_idl_delete_key: c_del id 
failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994)
Dec  8 15:47:05 server1 slapd[11177]: conn=15694 op=16: attribute 
"entryCSN" index delete failure
Dec  8 15:47:06 server1 slapd[11177]: => bdb_idl_delete_key: c_del id 
failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994)
Dec  8 15:47:06 server1 slapd[11177]: conn=15569 op=36: attribute 
"entryCSN" index delete failure
... and a bit more of the same.
Trying to modify cn=config with ldapmodify still makes server1 (and 
ldapmodify) hang though.
/Peter