Rein Tollevik wrote:
Howard Chu wrote:
And unfortunately I had no time to do any more debugging until now; with St. Patrick's Day this Tuesday I had gigs all weekend. I also see that the test050 run I left overnight eventually crashed, and the symptoms are the same as in Quanah's. So, there's still more to track down.
Look as if I might have hit the same, see stack trace at the end.
For reference:
violino:~/OD/hobj/tests/testrun> grep rid=003 !$ grep rid=003 slapd.1.log =>do_syncrepl rid=003 do_syncrepl: rid=003 retrying (9 retries left) =>do_syncrepl rid=003 =>do_syncrep2 rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 do_syncrepl: rid=003 quitting
The odd thing here of course is that it should never jump from '9 retries left' to 'quitting', there should be at least 9 failures / retry messages. Seems like we have a wild memory overwrite occurring.
I assume it is quitting due to config update. Looks to me as if syncinfo structures are released while still active.
OK. This must be occurring because a connection_client thread is in the thread pool but hasn't started running yet when the config change occurs. So the usual mutexes aren't held yet...
Rein
(gdb) where #0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 #1 0x0000002a968a4a0b in vfprintf () from /lib64/tls/libc.so.6 #2 0x0000002a968c4434 in vsnprintf () from /lib64/tls/libc.so.6 #3 0x0000002a958c3181 in lutil_debug (debug=<value optimized out>, level=<value optimized out>, fmt=0x448076c8 "$") at debug.c:66 #4 0x00000000004957d1 in do_syncrepl (ctx=0x44807e90, arg=0x858150) at syncrepl.c:1261 #5 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #6 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) print si $1 = (syncinfo_t *) 0x0 (gdb) print *rtask $2 = {next_sched = {tv_sec = 7598733802573148208, tv_usec = 14422794207978861}, interval = {tv_sec = 384, tv_usec = 64}, tnext = {stqe_next = 0x84bc30}, rnext = {stqe_next = 0x858870}, routine = 0, arg = 0x0, tname = 0x505cc0 "do_syncrepl", tspec = 0x857d94 "rid=004"}
(gdb) thr 8 [Switching to thread 8 (process 23265)]#0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 (gdb) frame 4 #4 0x00000000004957d1 in do_syncrepl (ctx=0x41801e90, arg=0x858a30) at syncrepl.c:1261 1261 Debug( LDAP_DEBUG_TRACE, "=>do_syncrepl %s\n", si->si_ridtxt, 0, 0 ); (gdb) print si $3 = (syncinfo_t *) 0x20 (gdb) print *rtask $4 = {next_sched = {tv_sec = 7526470944284832317, tv_usec = 7598542775770181185}, interval = {tv_sec = 8751185004989543539, tv_usec = 3683997482740818493}, tnext = {stqe_next = 0x6974202235203030}, rnext = {stqe_next = 0x333d74756f656d}, routine = 0xc0, arg = 0x20, tname = 0x84bc10 "\220\004", tspec = 0x69666e6f43657361<Address 0x69666e6f43657361 out of bounds>}