Full_Name: Tom Pressnell Version: 2.4.43 OS: Debian 8 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (149.254.186.170)
Hi,
I have been testing 2.4.43+ITS#8336 as a candidate for production usage. Compiled from source on Debian 8 (jessie) x86_64.
I have been experiencing segmentation faults in syncprov_matchops:
Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000562713 in syncprov_matchops (op=0x7ef1e8100c90, opc=0x7ef1b8000cf0, saveit=1) at syncprov.c:1332 1332 op2.ors_filter = ss->s_op->ors_filter->f_and->f_next;
I have been running replication testing, pushing relativly high rates of add/mod operations at a mdb master (NOSYNC) whilst a number of replication clients are connecting and disconnecting (simulating a lossy/faulty network) (killing of replication client scripts / tcpkill).
Looking at ss: $6 = {s_next = 0x7ef1b389f350, s_si = 0xe51cc0, s_base = {bv_len = 13, bv_val = 0x7ef172fb72a0 "dc=xyz,dc=com"}, s_eid = 1, s_op = 0x7ef1b4000aa0, s_rid = 0, s_sid = 0, s_filterstr = {bv_len = 15, bv_val = 0x7ef1b4001248 "(objectClass=*)"}, s_flag= D 17, s_inuse = 1, s_res = 0x7ef172fa71f0, s_restail = 0x7ef172f54450, s_mutex = {__data = {__lock = 1, __count = 0, __owner = 1006, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}C%C __size = "\001\000\000\000\000\000\000\000\356\003\000\000\001", '\000' <repeats 26 times>, __align = 1}}
And at s_op->ors_filter:
(gdb) p *ss->s_op->o_request->oq_search->rs_filter $2 = {f_choice = 161, f_un = {f_un_result = -1275063480, f_un_desc = 0x7ef1b4001348, f_un_ava = 0x7ef1b4001348, f_un_ssa = 0x7ef1b4001348, f_un_mra = 0x7ef1b4001348, f_un_complex = 0x7ef1b4001348}, f_next = 0x0} (gdb) p ss->s_op->o_request->oq_search->rs_filterstr $3 = {bv_len = 23, bv_val = 0x7ef1b40013e0 "(|(cn=4594)(cn=4594:1))"}
This is not the filter used by my syncrepl clients during this test (they all run with objectClass=* as show in ss->s_filterstr), this is one of the filters used by the add/mod script.
Looking at another thread (cutting down output): [Switching to thread 2 (Thread 0x7ef1bffff700 (LWP 2479))] #0 0x00000000004eeb59 in mdb_node_search (mc=0x7ef172ee63f0, key=0x7ef1bfe6d3c0, exactp=0x7ef1bfe6d03c) (gdb) bt #5 0x0000000000553326 in mdb_id2entry (op=0x7ef1b4000aa0, mc=0x7ef172ee63f0, id=26, e%3x7x7ef1bfe7d678) at id2entry.c:153
This thread is working with the same operation ...0aa0, but performing a standard search as i would expect given the filter value. Somehow ss->s_op seems to have ended up pointing at what seems to be an unreleated operation.
Looking at the code i believe the issue could trigger when an op is abandoned early before syncprov_op_search has got hold of the si_ops lock for the psearch sop. I have added a standard o_abandon check and return at line 2574 of syncprov.c while the si_ops lock is held, before sop is added to the list. This seems to have fixed the issue in my testing, i can see this code path is traversed (as i am logging it) a number of times over the last few days of running the tests.
I can provide more detailed backtraces if required. If you would like core dumps this will require extra time as i would have to replicate the test with non company data / schemas.
Thanks
Tom