ondra@mistotebe.net wrote:
On Thu, Oct 05, 2017 at 03:32:38AM +0000, quanah@openldap.org wrote:
Full_Name: Quanah Gibson-Mount Version: 2.4.45 OS: Linux URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (47.208.148.239)
There is a reproducible lockup with delta-sync based multimaster replication. The more masters exist the easier it is to trigger, with 100% reproducibility at 4 masters.
Generally:
Load a database onto a server with no syncprov/accesslog configuration.
Stop slapd
Reconfigure slapd.conf to include syncprov/accesslog overlays, and cn=accesslog DB for changes, and replication agreements with other servers
Configure other servers for syncprov/accesslog, but no database
start slapd on all servers
execute a MOD operation sequentially across the servers (1, 2, 3, 4)
slapd will deadlock on server 3 or 4 100% of the time
This is still an issue at least in master - running the its8752 regression test often results in the following deadlock:
This says that thread 4 is still holding the LMDB writer mutex while running inside the accesslog overlay. Which it should not be, since back-mdb acquires and commits its LMDB transactions before letting any other module have control.
(gdb) deadlock Cycles: Thread #5 (LWP 8024) in mdb_txn_renew0() at ./../../../libraries/liblmdb/mdb.c:2749 Thread #4 (LWP 8023) in accesslog_op_mod() at accesslog.c:1988 deadlock from root Thread #3 (LWP 8022) in mdb_txn_renew0() at ./../../../libraries/liblmdb/mdb.c:2749
(gdb) thread apply all bt
Thread 9 (Thread 0x7f9199ffd700 (LWP 15035)): #3 0x00007f91b815f20d in ldap_pvt_thread_cond_wait (cond=0xe11cb8, mutex=0xe11c90) at thr_posix.c:288
Thread 8 (Thread 0x7f91a09fa700 (LWP 15034)): #3 0x00007f91b815f20d in ldap_pvt_thread_cond_wait (cond=0xe11cb8, mutex=0xe11c90) at thr_posix.c:288
Thread 7 (Thread 0x7f91b12b7700 (LWP 8026)): #3 0x00007f91b815f20d in ldap_pvt_thread_cond_wait (cond=0xe11cb8, mutex=0xe11c90) at thr_posix.c:288
Thread 6 (Thread 0x7f91b1ab8700 (LWP 8025)): #3 0x00007f91b815f20d in ldap_pvt_thread_cond_wait (cond=0xe11cb8, mutex=0xe11c90) at thr_posix.c:288
Thread 5 (Thread 0x7f91b22b9700 (LWP 8024)): #0 0x00007f91b5dda352 in __pthread_mutex_lock_full [waiting on thread 4 (LWP 8023)] (mutex=0x7f91b8114040) at ../nptl/pthread_mutex_lock.c:313 #1 0x000000000051a881 in mdb_txn_renew0 (txn=0x1027ac0) at ./../../../libraries/liblmdb/mdb.c:2749 #2 0x000000000051afab in mdb_txn_begin (env=0xf24e10, parent=0x0, flags=0, ret=0x7f91b22b7190) at ./../../../libraries/liblmdb/mdb.c:2907 #3 0x00000000005792f4 in mdb_opinfo_get (op=Modify request = {...}, mdb=0x7f91b543b010, rdonly=0, moip=0x7f91b22b7178) at id2entry.c:740 #4 0x000000000052f989 in mdb_modify (op=Modify request = {...}, rs=0x7f91b22b89b8) at modify.c:584 #5 0x00000000004f6d65 in overlay_op_walk (op=Modify request = {...}, rs=0x7f91b22b89b8, which=op_modify, oi=0xe29b10, on=0x0) at backover.c:706 #6 0x00000000004f9d61 in over_op_func (op=Modify request = {...}, rs=0x7f91b22b89b8, which=op_modify) at backover.c:766 #7 0x00000000004f8fb2 in over_op_modify (op=Modify request = {...}, rs=0x7f91b22b89b8) at backover.c:808 #8 0x0000000000472e4e in fe_op_modify (op=Modify request = {...}, rs=0x7f91b22b89b8) at modify.c:307 #9 0x0000000000471bf5 in do_modify (op=Modify request = {...}, rs=0x7f91b22b89b8) at modify.c:177 #10 0x000000000044e1ce in connection_operation (ctx=0x7f91b22b8b78, arg_v=0x7f919c100fa0) at connection.c:1169 #11 0x000000000044c3a1 in connection_read_thread (ctx=0x7f91b22b8b78, argv=0xf) at connection.c:1326
Thread 4 (Thread 0x7f91b2aba700 (LWP 8023)): #2 0x00007f91b815f275 in ldap_pvt_thread_mutex_lock [waiting on thread 5 (LWP 8024)] (mutex=0xe7c328) at thr_posix.c:307 #3 0x00007f91b54fffbc in accesslog_op_mod (op=Modify request = {...}, rs=0x7f91b2ab8bd8) at accesslog.c:1988 #4 0x00000000004f6cc1 in overlay_op_walk (op=Modify request = {...}, rs=0x7f91b2ab8bd8, which=op_modify, oi=0xe29b10, on=0xe7c0c0) at backover.c:691 #5 0x00000000004f9d61 in over_op_func (op=Modify request = {...}, rs=0x7f91b2ab8bd8, which=op_modify) at backover.c:766 #6 0x00000000004f8fb2 in over_op_modify (op=Modify request = {...}, rs=0x7f91b2ab8bd8) at backover.c:808 #7 0x00000000004ec259 in syncrepl_updateCookie (si=0xe7aa90, op=Modify request = {...}, syncCookie=0x7f91b2ab91f0, save=1) at syncrepl.c:4051 #8 0x00000000004e978e in do_syncrep2 (op=Modify request = {...}, si=0xe7aa90) at syncrepl.c:1206 #9 0x00000000004e2556 in do_syncrepl (ctx=0x7f91b2ab9b78, arg=0xe7b030) at syncrepl.c:1604
Thread 3 (Thread 0x7f91b32bb700 (LWP 8022)): #0 0x00007f91b5dda352 in __pthread_mutex_lock_full [waiting on thread 4 (LWP 8023)] (mutex=0x7f91b8114040) at ../nptl/pthread_mutex_lock.c:313 #1 0x000000000051a881 in mdb_txn_renew0 (txn=0x1027ac0) at ./../../../libraries/liblmdb/mdb.c:2749 #2 0x000000000051afab in mdb_txn_begin (env=0xf24e10, parent=0x0, flags=0, ret=0x7f91a4001170) at ./../../../libraries/liblmdb/mdb.c:2907 #3 0x00000000005792f4 in mdb_opinfo_get (op=Search request = {...}, mdb=0x7f91b543b010, rdonly=0, moip=0xe7b348) at id2entry.c:740 #4 0x000000000057989d in mdb_txn (op=Search request = {...}, txnop=1, ptr=0xe7b348) at id2entry.c:812 #5 0x00000000004ee059 in syncrepl_entry (si=0xe7b0f0, op=Search request = {...}, entry="dc=example,dc=com", modlist=0x7f91b32ba1a8, syncstate=1, syncUUID=0x7f91b32ba140, syncCSN=0x0) at syncrepl.c:3087 #6 0x00000000004e8cac in do_syncrep2 (op=Search request = {...}, si=0xe7b0f0) at syncrepl.c:1056 #7 0x00000000004e2556 in do_syncrepl (ctx=0x7f91b32bab78, arg=0xe7b6d0) at syncrepl.c:1604
Thread 2 (Thread 0x7f91b3abc700 (LWP 8021)): #0 0x00007f91b5d0b207 in epoll_wait (epfd=6, events=0xdebc50, maxevents=1024, timeout=7000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x000000000044591a in slapd_daemon_task (ptr=0x1232700) at daemon.c:2745
Thread 1 (Thread 0x7f91b5761780 (LWP 8006)): #1 0x00007f91b815f11d in ldap_pvt_thread_join (thread=140263761364736, thread_return=0x0) at thr_posix.c:208 #2 0x0000000000443a1b in slapd_daemon () at daemon.c:3157 #3 0x000000000041ec5b in main (argc=8, argv=0x7ffcee967dc8) at main.c:1018