Hi,
I've loaded my mirror mode setup with data and let it run for a few day, Both cn=config and the application database is mirrored. Only server1 is receiving writes from the application.
OpenLDAP 2.4.20, BDB 4.8
After about 6 hours the mirror partly broke and I experience 3 symptoms:
1) The syncrepl connection from server1->server2 for the application database is missing and data only flows from server1 to server2 - not the other way. The cn=config connections exists.
$ netstat -tna # shows tcp 0 0 192.168.0.102:636 0.0.0.0:* LISTEN tcp 8125 0 192.168.0.102:45535 192.168.0.101:636 ESTABLISHED tcp 0 0 192.168.0.102:636 192.168.0.101:34954 ESTABLISHED tcp 0 0 192.168.0.102:45537 192.168.0.101:636 ESTABLISHED
Where it should show, something like: tcp 0 0 192.168.0.101:636 0.0.0.0:* LISTEN tcp 0 0 192.168.0.101:34954 192.168.0.102:636 ESTABLISHED tcp 261 0 192.168.0.101:33409 192.168.0.102:636 ESTABLISHED tcp 0 0 192.168.0.101:636 192.168.0.102:45537 ESTABLISHED tcp 0 0 192.168.0.101:636 192.168.0.102:33226 ESTABLISHED
2) Meanwhile the log on server1 says: Dec 8 02:04:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -1 retrying Dec 8 02:05:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying Dec 8 02:06:03 server1 slapd[6863]: do_syncrepl: rid=004 rc -2 retrying etc...
The first such entry appear around 6 hours after start of the mirror.
3) If I try to change cn=config with ldapmodify on either server, server1 will hang, not answering queries until I restart it. For instance, if I do: ---------- dn: cn=config changetype: modify replace: olcLogLevel olcLogLevel: None sync ----------- ... it'l hang.
I was able to connect and search the database on both server, to both servers like (on server1), using client certs: ldapsearch -H ldaps://server2/ -YEXTERNAL -b cn=data,dc=example,dc=com -s sub -D cn=config '(cn=*)' + *
So it's not that the TCP connection can't be established. Which make me suspect that this is related to this thread: http://www.mail-archive.com/openldap-software@openldap.org/msg16028.html
Now after 27 hours the connection finally came back by it self, and replication works both ways. The "rc -2 retrying" in the log on server1 stopped and was replaced by:
Dec 8 15:39:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying Dec 8 15:40:34 server1 slapd[11177]: do_syncrepl: rid=004 rc -2 retrying Dec 8 15:42:15 server1 slapd[11177]: => bdb_idl_insert_key: c_put id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:05 server1 slapd[11177]: conn=15694 op=16: attribute "entryCSN" index delete failure Dec 8 15:47:06 server1 slapd[11177]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 8 15:47:06 server1 slapd[11177]: conn=15569 op=36: attribute "entryCSN" index delete failure ... and a bit more of the same.
Trying to modify cn=config with ldapmodify still makes server1 (and ldapmodify) hang though.
/Peter
Peter Mogensen wrote:
Trying to modify cn=config with ldapmodify still makes server1 (and ldapmodify) hang though.
Actually, trying to turn sync logging off this time made it hang so hard, that it wouldn't stop by SIGTERM. It only did: Dec 8 16:28:48 server1 slapd[29753]: daemon: shutdown requested and initiated. Dec 8 16:28:48 server1 slapd[29753]: slapd shutdown: waiting for 54 operations/tasks to finish
One thread was hanging in a read() on the FD for the connection to server2.
It was only when I also stopped server2, that server1 actually stopped.
/Peter
--On Tuesday, December 08, 2009 5:40 PM +0100 Peter Mogensen apm@mutex.dk wrote:
It was only when I also stopped server2, that server1 actually stopped.
I would of course, at this point, expect an ITS has been filed, and a fully detailed gdb backtrace submitted along with it.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Tuesday, December 08, 2009 5:40 PM +0100 Peter Mogensen apm@mutex.dk wrote:
It was only when I also stopped server2, that server1 actually stopped.
I would of course, at this point, expect an ITS has been filed, and a fully detailed gdb backtrace submitted along with it.
I'll try to reproduce something better to post in an ITS. As I said, it doens actually crash, it just hangs. I can see with tcpdump, that server1 actually tries to connect once every minute to server2 and does establish an TLS connection, but after af few frames of application data it sends close notify. Another thing bothering me is that a few threads on server1 are using 99.9% CPU. (server2 is not accessed by clients, so it is mostly idle)
Is it possible to temporarily turn of mirroring of cn=config, so I can raise loglevels on server2 without the change being replicated to server1 and thus hanging the whole system ?
What happens if I just turn "mirrormode off" on server2 for a while?
/Peter
Peter Mogensen wrote:
Another thing bothering me is that a few threads on server1 are using 99.9% CPU.
Actually... it's not the same thread. There's a constant re-spawning of threads and many of them seem to end up user 99.9% CPU.
Can that be releated to the log-messages?:
Dec 10 12:12:11 server1 slapd[31416]: => bdb_idl_delete_key: c_del id failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) Dec 10 12:12:11 server1 slapd[31416]: conn=140799 op=20: attribute "entryCSN" index delete failure
/Peter
On Thu, 10 Dec 2009 09:36:33 +0100, Peter Mogensen apm@mutex.dk wrote:
Quanah Gibson-Mount wrote:
--On Tuesday, December 08, 2009 5:40 PM +0100 Peter Mogensen apm@mutex.dk wrote:
It was only when I also stopped server2, that server1 actually stopped.
I would of course, at this point, expect an ITS has been filed, and a fully detailed gdb backtrace submitted along with it.
I'll try to reproduce something better to post in an ITS. As I said, it doens actually crash, it just hangs.
Remember you can use gdb to connect to a running process, using it's PID, with a syntax like: gdb /path/to/slapd PID
Then, something like "thread apply all bt".
I can see with tcpdump, that server1 actually tries to connect once every minute to server2 and does establish an TLS connection, but after af few frames of application data it sends close notify. Another thing bothering me is that a few threads on server1 are using 99.9% CPU. (server2 is not accessed by clients, so it is mostly idle)
Is it possible to temporarily turn of mirroring of cn=config, so I can raise loglevels on server2 without the change being replicated to server1 and thus hanging the whole system ?
Of course. However, according to your description of this problem, it seems to be related to replication. So turning replication off will likely interfere with your examination of the problem.
I recommend running slapd with the -d switch to see debug output (maybe redirecting it to a file). Using the monitor overlay may also be useful, to observe current connections on each server.
Hope this helps, Jonathan
Jonathan Clarke wrote:
Is it possible to temporarily turn of mirroring of cn=config, so I can raise loglevels on server2 without the change being replicated to server1 and thus hanging the whole system ?
Of course. However, according to your description of this problem, it seems to be related to replication. So turning replication off will likely interfere with your examination of the problem.
Yes, but I planned to only turn replication of for cn=config, not for the data replication, which is where the problem with lost connections is.
I recommend running slapd with the -d switch to see debug output (maybe redirecting it to a file). Using the monitor overlay may also be useful, to observe current connections on each server.
will try...
/Peter
On Thu, 10 Dec 2009 15:35:26 +0100, Peter Mogensen apm@mutex.dk wrote:
Jonathan Clarke wrote:
Is it possible to temporarily turn of mirroring of cn=config, so I can raise loglevels on server2 without the change being replicated to server1 and thus hanging the whole system ?
Of course. However, according to your description of this problem, it seems to be related to replication. So turning replication off will likely interfere with your examination of the problem.
Yes, but I planned to only turn replication of for cn=config, not for the data replication, which is where the problem with lost connections
is.
Ah, yes. Just remove the olcSyncrepl attribute in the configuration entry for that database then. You could do this on either server, the change should be replicated over to the other (then replication stops, of course).
Jonathan
openldap-technical@openldap.org