Full_Name: Kevan Carstensen Version: 2.4.28 OS: gentoo linux URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (134.71.248.23)
During a heavy update load, replication from our master to one of our read-only ldap replicas started failing due to locking errors on the replica.
Dec 14 04:01:08 pip-dev slapd[12645]: bdb(dc=csupomona,dc=edu): Lock table is out of available lock entries Dec 14 04:01:08 pip-dev slapd[12645]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12) Dec 14 04:01:08 pip-dev slapd[12645]: conn=-1 op=0: attribute "memberUid" index delete failure Dec 14 04:01:08 pip-dev slapd[12645]: null_callback : error code 0x50 Dec 14 04:01:08 pip-dev slapd[12645]: syncrepl_entry: rid=001 be_modify failed (80) Dec 14 04:01:08 pip-dev slapd[12645]: do_syncrepl: rid=001 rc 80 retrying
(these are due to an incorrect DB_CONFIG on the replica and are not related to this issue; I include them in case their impact on how/when the replica connects to the master are important)
After about ten minutes of the replica trying and failing to start its syncrepl again, the master slapd crashed. Nothing in the master slapd logs specifically references the crash or indicates why it happened. Shortly after the replica broke, some unusual messages started showing up in the logs on the master:
Dec 14 03:50:11 fosse slapd[22596]: connection_read(65): no connection! Dec 14 03:50:22 fosse slapd[22596]: send_search_entry: conn 7535053 ber write failed. Dec 14 03:50:32 fosse slapd[22596]: send_search_entry: conn 7535061 ber write failed. Dec 14 03:50:42 fosse slapd[22596]: send_search_entry: conn 7535065 ber write failed. Dec 14 03:50:56 fosse slapd[22596]: send_search_entry: conn 7535068 ber write failed. Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection! Dec 14 03:51:06 fosse slapd[22596]: connection_read(65): no connection!
These continued until the crash, persisted after the master slapd was successfully restarted, and went away when I turned the broken replica off.
I found similar behavior in a development setup similar to our production setup. There is one master, and there are two read-only replicas configured to replicate from the master with the following syncrepl stanza:
syncrepl rid=1 provider=ldaps://master-2.ldap.csupomona.edu/ type=refreshAndPersist retry="10 10 60 +" searchbase="dc=csupomona,dc=edu" bindmethod=simple binddn=cn=slapd-syncrepl,ou=user,ou=service,dc=csupomona,dc=edu credentials=__PASSWORD__
(there are two replicas because I couldn't reliably produce a crash with one replica)
The replicas have a DB_CONFIG that causes their replication attempts to fail with the same error as we saw in our production setup. The master database is initialized so that the replicas have a lot of changes to catch up on, as was the case when our production setup failed. Even with loglevel -1, there's nothing specifically referencing the crash in the master slapd's logs. There's an assertion failure in the backtrace:
slapd: daemon.c:942: slapd_clr_write: Assertion `((slap_daemon[id].sd_index[(s)]) != -1)' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 0x7fffea863700 (LWP 27623)] 0x00007ffff6a95175 in raise () from /lib64/libc.so.6 (gdb) bt full #0 0x00007ffff6a95175 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff6a96590 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007ffff6a8e201 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #3 0x000000000042ec9b in slapd_clr_write (s=<value optimized out>, wake=0) at daemon.c:942 id = 0 __PRETTY_FUNCTION__ = "slapd_clr_write" #4 0x000000000043419b in connection_write (s=17) at connection.c:1896 c = <value optimized out> op = <value optimized out> __PRETTY_FUNCTION__ = "connection_write" #5 0x000000000043006a in slapd_daemon_task (ptr=<value optimized out>) at daemon.c:2782 rc = 1 fd = 17 w = 6 ns = 1 at = 0 revents = <value optimized out> tvp = <value optimized out> cat = {tv_sec = 1324320980, tv_usec = 0} i = 0 now = <value optimized out> tv = {tv_sec = 1135, tv_usec = 0} rtask = <value optimized out> l = <value optimized out> last_idle_check = 1324317380 ebadf = 0 tid = <value optimized out> #6 0x00007ffff71ed914 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #7 0x00007ffff6b347dd in clone () from /lib64/libc.so.6 No symbol table info available.
The master and replicas in my development setup all run 2.4.28. The production master runs 2.4.26. I successfully reproduced the issue on both 2.4.26 and 2.4.28, but only have a backtrace for 2.4.28. If you want a backtrace for 2.4.26, I can get one.
Thanks.