Hi,
I've loaded my mirror mode setup with data and let it run for a few day, Both cn=config and the application database is mirrored. Only server1 is receiving writes from the application.
OpenLDAP 2.4.20, BDB 4.8
After about 6 hours the mirror partly broke and I experience 3 symptoms:
1) The syncrepl connection from server1->server2 for the application database is missing and data only flows from server1 to server2 - not the other way. The cn=config connections exists.
$ netstat -tna # shows tcp 0 0 192.168.0.102:636 0.0.0.0:* LISTEN tcp 8125 0 192.168.0.102:45535 192.168.0.101:636 ESTABLISHED tcp 0 0 192.168.0.102:636 192.168.0.101:34954 ESTABLISHED tcp 0 0 192.168.0.102:45537 192.168.0.101:636 ESTABLISHED
Where it should show, something like: tcp 0 0 192.168.0.101:636 0.0.0.0:* LISTEN tcp 0 0 192.168.0.101:34954 192.168.0.102:636 ESTABLISHED tcp 261 0 192.168.0.101:33409 192.168.0.102:636 ESTABLISHED tcp 0 0 192.168.0.101:636 192.168.0.102:45537 ESTABLISHED tcp 0 0 192.168.0.101:636 192.168.0.102:33226 ESTABLISHED
2) Meanwhile the log on server1 says: Dec 8 02:04:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -1 retrying Dec 8 02:05:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying Dec 8 02:06:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying etc...
The first such entry appear around 6 hours after start of the mirror.
3) If I try to change cn=config with ldapmodify on either server, server1 will hang, not answering queries until I restart it. For instance, if I do: ---------- dn: cn=config changetype: modify replace: olcLogLevel olcLogLevel: None sync ----------- ... it'l hang.
I was able to connect and search the database on both server, to both servers like (on server1), using client certs: ldapsearch -H ldaps://server2/ -YEXTERNAL -b cn=data,dc=example,dc=com -s sub -D cn=config '(cn=*)' + *
So it's not that the TCP connection can't be established. Which make me suspect that this is related to this thread: http://www.mail-archive.com/openldap-software@openldap.org/msg16028.html
Now after 27 hours the connection finally came back by it self, and replication works both ways. The "rc -2 retrying" in the log on server1 stopped and was replaced by:
Dec 8 15:39:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying Dec 8 15:40:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying Dec 8 15:42:15 server1 slapd[11177]: => bdb_idl_insert_key: c_put id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: conn=15694 op=16: attribute "entryCSN" index delete failure Dec 8 15:47:06 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:06 server1 slapd[11177]: conn=15569 op=36: attribute "entryCSN" index delete failure ... and a bit more of the same.
Trying to modify cn=config with ldapmodify still makes server1 (and ldapmodify) hang though.
/Peter