https://bugs.openldap.org/show_bug.cgi?id=9421
Issue ID: 9421 Summary: SIGSEGV in the MMR synchro Product: OpenLDAP Version: 2.4.56 Hardware: x86_64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: --- Component: overlays Assignee: bugs@openldap.org Reporter: benjamin.demarteau@liege.be Target Milestone: ---
We are in the process of migrating from a single outdated node to an up to date MMR cluster. Through this process we write LSC synchronizations from the old server to the new server so we can keep the old server around. Our preliminary tests show that when LSC hammers the ldap using multiple threads while another node is included in the replication, we get segmentation faults with the following backtrace:
#0 0x00007f7f578748ef in __strncasecmp_l_avx () from /lib64/libc.so.6 #1 0x000056094a7ca298 in avl_find (root=0x56094bb28820, data=data@entry=0x7f7e74000cd0, fcmp=fcmp@entry=0x56094a7166a0 <oc_index_name_cmp>) at avl.c:545 #2 0x000056094a716bde in oc_bvfind (ocname=0x7f7e74000cd0) at oc.c:186 #3 oc_bvfind (ocname=ocname@entry=0x7f7e74000cd0) at oc.c:178 #4 0x000056094a70ec5a in objectSubClassMatch (matchp=0x7f7e5fff8c8c, flags=256, syntax=<optimized out>, mr=<optimized out>, value=<optimized out>, assertedValue=0x7f7e74000cd0) at schema_prep.c:214 #5 0x000056094a6e9fb9 in ordered_value_match (match=match@entry=0x7f7e5fff8c8c, ad=0x56094bb184e0, mr=mr@entry=0x56094bb09810, flags=flags@entry=256, v1=v1@entry=0x7f7e5810f470, v2=v2@entry=0x7f7e74000cd0, text=0x7f7e5fff8c90) at value.c:693 #6 0x000056094a6ec44d in test_ava_filter (op=op@entry=0x7f7e5fff90c0, e=e@entry=0x56094bb54a88, ava=0x7f7e74000cc8, type=type@entry=163) at filterentry.c:777 #7 0x000056094a6ecfec in test_filter (op=op@entry=0x7f7e5fff90c0, e=e@entry=0x56094bb54a88, f=f@entry=0x7f7e74000d08) at filterentry.c:88 #8 0x000056094a6ecc81 in test_filter_and (flist=<optimized out>, e=0x56094bb54a88, op=0x7f7e5fff90c0) at filterentry.c:879 #9 test_filter (op=op@entry=0x7f7e5fff90c0, e=0x56094bb54a88, f=<optimized out>) at filterentry.c:118 #10 0x00007f7f5382c58f in syncprov_matchops (op=op@entry=0x7f7e5fff9c80, opc=opc@entry=0x7f7e58001808, saveit=saveit@entry=0) at syncprov.c:1393 #11 0x00007f7f5382e37f in syncprov_op_response (op=0x7f7e5fff9c80, rs=<optimized out>) at syncprov.c:2115 #12 0x000056094a6dcb98 in slap_response_play (op=op@entry=0x7f7e5fff9c80, rs=rs@entry=0x7f7e5fff9c10) at result.c:508 #13 0x000056094a6dd11c in send_ldap_response (op=op@entry=0x7f7e5fff9c80, rs=rs@entry=0x7f7e5fff9c10) at result.c:583 #14 0x000056094a6ddd43 in slap_send_ldap_result (op=0x7f7e5fff9c80, rs=0x7f7e5fff9c10) at result.c:861 #15 0x000056094a7a86fd in mdb_add (op=0x7f7e5fff9c80, rs=0x7f7e5fff9c10) at add.c:435 #16 0x000056094a73cd78 in overlay_op_walk (op=op@entry=0x7f7e5fff9c80, rs=0x7f7e5fff9c10, which=op_add, oi=0x56094bb8a720, on=<optimized out>) at backover.c:677 #17 0x000056094a73ceab in over_op_func (op=0x7f7e5fff9c80, rs=<optimized out>, which=<optimized out>) at backover.c:730 #18 0x00007f7f5361ff6a in accesslog_response (op=<optimized out>, rs=<optimized out>) at accesslog.c:1877 #19 0x000056094a6dcb98 in slap_response_play (op=op@entry=0x7f7e7410fff0, rs=rs@entry=0x7f7e5fffa870) at result.c:508 #20 0x000056094a6dd11c in send_ldap_response (op=op@entry=0x7f7e7410fff0, rs=rs@entry=0x7f7e5fffa870) at result.c:583 #21 0x000056094a6ddd43 in slap_send_ldap_result (op=0x7f7e7410fff0, rs=0x7f7e5fffa870) at result.c:861 #22 0x000056094a7a86fd in mdb_add (op=0x7f7e7410fff0, rs=0x7f7e5fffa870) at add.c:435 #23 0x000056094a73cd78 in overlay_op_walk (op=op@entry=0x7f7e7410fff0, rs=0x7f7e5fffa870, which=op_add, oi=0x56094bb8a900, on=<optimized out>) at backover.c:677 #24 0x000056094a73ceab in over_op_func (op=0x7f7e7410fff0, rs=<optimized out>, which=<optimized out>) at backover.c:730 #25 0x000056094a6d32bd in fe_op_add (op=0x7f7e7410fff0, rs=0x7f7e5fffa870) at add.c:334 #26 0x000056094a6d4139 in do_add (op=0x7f7e7410fff0, rs=0x7f7e5fffa870) at add.c:194 #27 0x000056094a6cbfc0 in connection_operation (ctx=ctx@entry=0x7f7e5fffaab0, arg_v=arg_v@entry=0x7f7e7410fff0) at connection.c:1175 #28 0x000056094a6ccdbe in connection_read_thread (ctx=0x7f7e5fffaab0, argv=0x1a) at connection.c:1311 #29 0x00007f7f5903bead in ldap_int_thread_pool_wrapper (xpool=0x56094bb2a1d0) at tpool.c:696 #30 0x00007f7f57ae414a in start_thread () from /lib64/libpthread.so.0 #31 0x00007f7f57815f23 in clone () from /lib64/libc.so.6
If we take down the second node, we cannot reproduce the segfaults anymore.
Let me know if we can provide more information (we can't provide the core dump since it's full of passwords).
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #1 from Quanah Gibson-Mount quanah@openldap.org --- Can you also provide the stats + sync log output prior to this happening? Are you using standard syncrepl or delta-syncrepl?
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #2 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Dec 08, 2020 at 11:39:37AM +0000, openldap-its@openldap.org wrote:
We are in the process of migrating from a single outdated node to an up to date MMR cluster. Through this process we write LSC synchronizations from the old server to the new server so we can keep the old server around. Our preliminary tests show that when LSC hammers the ldap using multiple threads while another node is included in the replication, we get segmentation faults with the following backtrace:
#0 0x00007f7f578748ef in __strncasecmp_l_avx () from /lib64/libc.so.6 #1 0x000056094a7ca298 in avl_find (root=0x56094bb28820, data=data@entry=0x7f7e74000cd0, fcmp=fcmp@entry=0x56094a7166a0 <oc_index_name_cmp>) at avl.c:545 #2 0x000056094a716bde in oc_bvfind (ocname=0x7f7e74000cd0) at oc.c:186 #3 oc_bvfind (ocname=ocname@entry=0x7f7e74000cd0) at oc.c:178 #4 0x000056094a70ec5a in objectSubClassMatch (matchp=0x7f7e5fff8c8c, flags=256, syntax=<optimized out>, mr=<optimized out>, value=<optimized out>, assertedValue=0x7f7e74000cd0) at schema_prep.c:214 #5 0x000056094a6e9fb9 in ordered_value_match (match=match@entry=0x7f7e5fff8c8c, ad=0x56094bb184e0, mr=mr@entry=0x56094bb09810, flags=flags@entry=256, v1=v1@entry=0x7f7e5810f470, v2=v2@entry=0x7f7e74000cd0, text=0x7f7e5fff8c90) at value.c:693 #6 0x000056094a6ec44d in test_ava_filter (op=op@entry=0x7f7e5fff90c0, e=e@entry=0x56094bb54a88, ava=0x7f7e74000cc8, type=type@entry=163) at filterentry.c:777 #7 0x000056094a6ecfec in test_filter (op=op@entry=0x7f7e5fff90c0, e=e@entry=0x56094bb54a88, f=f@entry=0x7f7e74000d08) at filterentry.c:88 #8 0x000056094a6ecc81 in test_filter_and (flist=<optimized out>, e=0x56094bb54a88, op=0x7f7e5fff90c0) at filterentry.c:879 #9 test_filter (op=op@entry=0x7f7e5fff90c0, e=0x56094bb54a88, f=<optimized out>) at filterentry.c:118 #10 0x00007f7f5382c58f in syncprov_matchops (op=op@entry=0x7f7e5fff9c80, opc=opc@entry=0x7f7e58001808, saveit=saveit@entry=0) at syncprov.c:1393
Can you print out the full backtrace (bt full) at least up until this point and as much of the entry as you can, (masking out any private data), but definitely the objectclass name in frame 2 above?
Relevant syncrepl and syncprov config might also be helpful.
If we take down the second node, we cannot reproduce the segfaults anymore.
Let me know if we can provide more information (we can't provide the core dump since it's full of passwords).
Thanks,
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #3 from benjamin.demarteau@liege.be ---
Can you also provide the stats + sync log output prior to this happening? Are you using standard syncrepl or delta-syncrepl?
I'll try and get more logs soon, see below for the syncrepl config.
Can you print out the full backtrace (bt full) at least up until this
point and as much of the entry as you can, (masking out any private data), but definitely the objectclass name in frame 2 above?
I didn't see anything private, please let me know if something slipped.
#0 0x00007fda6c4168ef in __strncasecmp_l_avx () from /lib64/libc.so.6 No symbol table info available. #1 0x0000563ea88e1298 in avl_find (root=0x563eaa404820, data=data@entry=0x7fd98c000cd0, fcmp=fcmp@entry=0x563ea882d6a0 <oc_index_name_cmp>) at avl.c:545 cmp = <optimized out> #2 0x0000563ea882dbde in oc_bvfind (ocname=0x7fd98c000cd0) at oc.c:186 oir = <optimized out> oir = <optimized out> #3 oc_bvfind (ocname=ocname@entry=0x7fd98c000cd0) at oc.c:178 oir = <optimized out> #4 0x0000563ea8825c5a in objectSubClassMatch (matchp=0x7fd958cf8f9c, flags=256, syntax=<optimized out>, mr=<optimized out>, value=<optimized out>, assertedValue=0x7fd98c000cd0) at schema_prep.c:214 a = 0x7fd98c000cd0 oc = 0x563eaa45c7d0 asserted = <optimized out> #5 0x0000563ea8800fb9 in ordered_value_match (match=match@entry=0x7fd958cf8f9c, ad=0x563eaa3f44e0, mr=mr@entry=0x563eaa3e5810, flags=flags@entry=256, v1=v1@entry=0x7fd95011dd10, v2=v2@entry=0x7fd98c000cd0, text=0x7fd958cf8fa0) at value.c:693 bv1 = {bv_len = 0, bv_val = 0x0} bv2 = {bv_len = 94827115917557, bv_val = 0x7fd900000000 <error: Cannot access memory at address 0x7fd900000000>} #6 0x0000563ea880344d in test_ava_filter (op=op@entry=0x7fd958cf93d0, e=e@entry=0x563eaa430998, ava=0x7fd98c000cc8, type=type@entry=163) at filterentry.c:777 ret = <optimized out> match = 32729 text = 0x7fd958cf8fe8 "\354?\200\250>V" use = 256 mr = 0x563eaa3e5810 bv = 0x7fd95011dd10 rc = 5 a = 0x563eaa44c2c0 __PRETTY_FUNCTION__ = "test_ava_filter" #7 0x0000563ea8803fec in test_filter (op=op@entry=0x7fd958cf93d0, e=e@entry=0x563eaa430998, f=f@entry=0x7fd98c000d08) at filterentry.c:88 rc = <optimized out> #8 0x0000563ea8803c81 in test_filter_and (flist=<optimized out>, e=0x563eaa430998, op=0x7fd958cf93d0) at filterentry.c:879 rc = <optimized out> f = 0x7fd98c000d08 rtn = 6 f = <optimized out> rtn = <optimized out> rc = <optimized out> #9 test_filter (op=op@entry=0x7fd958cf93d0, e=0x563eaa430998, f=<optimized out>) at filterentry.c:118 rc = <optimized out> #10 0x00007fda683ce58f in syncprov_matchops (op=op@entry=0x7fd958cf9f90, opc=opc@entry=0x7fd950121ad0, saveit=saveit@entry=0) at syncprov.c:1393 op2 = {o_hdr = 0x7fd958cf9280, o_tag = 99, o_time = 1607446933, o_tincr = 39, o_bd = 0x563eaa4998d0, o_req_dn = {bv_len = 12, bv_val = 0x7fd98c000c80 "cn=acces\030"}, o_req_ndn = {bv_len = 12, bv_val = 0x7fd98c000cb0 "cn=acces\030"}, o_request = {oq_add = {rs_modlist = 0x2, rs_e = 0xffffffffffffffff}, oq_bind = {rb_method = 2, rb_cred = {bv_len = 18446744073709551615, bv_val = 0x563eaa49dfb8 "\377\377\377\377"}, rb_edn = {bv_len = 0, bv_val = 0x7fd98c000e28 "\005"}, rb_ssf = 2348813688, rb_mech = {bv_len = 101, bv_val = 0x7fd98c001308 "(&(entryCSN>=20201201110619.762262Z#000000#003#000000)(&(objectClass=auditWriteObject)(reqResult=0)))"}}, oq_compare = {rs_ava = 0x2}, oq_modify = {rs_mods = { rs_modlist = 0x2, rs_no_opattrs = -1 '\377'}, rs_increment = -1437999176}, oq_modrdn = {rs_mods = {rs_modlist = 0x2, rs_no_opattrs = -1 '\377'}, rs_deleteoldrdn = -1437999176, rs_newrdn = { bv_len = 0, bv_val = 0x7fd98c000e28 "\005"}, rs_nnewrdn = {bv_len = 140572333444472, bv_val = 0x65 <error: Cannot access memory at address 0x65>}, rs_newSup = 0x7fd98c001308, rs_nnewSup = 0x0}, oq_search = {rs_scope = 2, rs_deref = 0, rs_slimit = -1, rs_tlimit = -1, rs_limit = 0x563eaa49dfb8, rs_attrsonly = 0, rs_attrs = 0x7fd98c000e28, rs_filter = 0x7fd98c000d78, rs_filterstr = { bv_len = 101, bv_val = 0x7fd98c001308 "(&(entryCSN>=20201201110619.762262Z#000000#003#000000)(&(objectClass=auditWriteObject)(reqResult=0)))"}}, oq_abandon = {rs_msgid = 2}, oq_cancel = { rs_msgid = 2}, oq_extended = {rs_reqoid = {bv_len = 2, bv_val = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>}, rs_flags = -1437999176, rs_reqdata = 0x0}, oq_pwdexop = {rs_extended = {rs_reqoid = {bv_len = 2, bv_val = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>}, rs_flags = -1437999176, rs_reqdata = 0x0}, rs_old = { bv_len = 140572333444648, bv_val = 0x7fd98c000d78 "\240"}, rs_new = {bv_len = 101, bv_val = 0x7fd98c001308 "(&(entryCSN>=20201201110619.762262Z#000000#003#000000)(&(objectClass=auditWriteObject)(reqResult=0)))"}, rs_mods = 0x0, rs_modtail = 0x0}}, o_abandon = 0, o_cancel = 0, o_groups = 0x0, o_do_not_cache = 0 '\000', o_is_auth_check = 0 '\000', o_dont_replicate = 0 '\000', o_acl_priv = ACL_NONE, o_nocaching = 0 '\000', o_delete_glue_parent = 0 '\000', o_no_schema_check = 0 '\000', o_no_subordinate_glue = 0 '\000', o_ctrlflag = '\000' <repeats 14 times>, "\003\000\000\062", '\000' <repeats 13 times>, o_controls = 0x7fd99014f670, o_authz = { sai_method = 128, sai_mech = {bv_len = 0, bv_val = 0x0}, sai_dn = {bv_len = 51, bv_val = 0x7fd95011ec40 "cn=replicator,ou=technical,ou=people,dc=liege,dc=be"}, sai_ndn = {bv_len = 51, bv_val = 0x7fd98c1680a0 "cn=replicator,ou=technical,ou=people,dc=liege,dc=be"}, sai_ssf = 256, sai_transport_ssf = 0, sai_tls_ssf = 256, sai_sasl_ssf = 0}, o_ber = 0x7fd950115250, o_res_ber = 0x0, o_callback = 0x0, o_ctrls = 0x7fd98c001198, o_csn = {bv_len = 0, bv_val = 0x0}, o_private = 0x0, o_extra = {slh_first = 0x0}, o_next = {stqe_next = 0x0}} sm = <optimized out> ss = 0x7fd98c18fd70 oh = {oh_opid = 13, oh_connid = 336443, oh_conn = 0x563eaa504830, oh_msgid = 14, oh_protocol = 3, oh_tid = 140571474638592, oh_threadctx = 0x7fd958cfaab0, oh_tmpmemctx = 0x7fd95011e5e0, oh_tmpmfuncs = 0x563ea8b69fa0 <slap_sl_mfuncs>, oh_counters = 0x7fd9501204e0, oh_log_prefix = "conn=336411 op=13", '\000' <repeats 238 times>, oh_extensions = 0x0} found = 0 snext = <optimized out> on = 0x563eaa49b280 si = 0x563eaa49b460 fc = {fdn = 0x7fd958cf9fc8, fss = 0x7fd98c18fd70, fbase = 1, fscope = 1} pss = 0x563eaa49b460 e = 0x563eaa430998 a = <optimized out> rc = <optimized out> gonext = 1 newdn = {bv_len = 140571474629160, bv_val = 0x7fd958cf9400 "\200\f"} freefdn = <optimized out> b0 = 0x7fd958cf9cd0 db = {bd_info = 0x0, bd_self = 0x563eaa6e5fc0, be_ctrls = "\341!\000\000\000\000\000\000\364\004\210\250>V\000\000\344\227\317X\331\177\000\000P\230\317X\331\177\000", <incomplete sequence \340>, be_flags = 15, be_restrictops = 140571474630304, be_requires = 18005031565085346560, be_ssf_set = {sss_ssf = 1489999952, sss_transport = 32729, sss_tls = 2827488567, sss_sasl = 22078, sss_update_ssf = 1489999968, sss_update_transport = 32729, sss_update_tls = 2827492915, sss_update_sasl = 22078, sss_simple_bind = 1489999296}, be_suffix = 0x563e00010000, be_nsuffix = 0x8, be_schemadn = {bv_len = 140569985170232, bv_val = 0x0}, be_schemandn = {bv_len = 18005031565085346560, bv_val = 0x0}, be_rootdn = {bv_len = 140571474630736, bv_val = 0x1958cf97a0 <error: Cannot access memory at address 0x1958cf97a0>}, be_rootndn = {bv_len = 34359738399, bv_val = 0xe <error: Cannot access memory at address 0xe>}, be_rootpw = { bv_len = 4294936498, bv_val = 0x0}, be_max_deref_depth = 2827506529, be_def_limit = {lms_t_soft = 22078, lms_t_hard = -1435607104, lms_s_soft = 22078, lms_s_hard = 0, lms_s_unchecked = 0, lms_s_pr = 0, lms_s_pr_hide = 32729, lms_s_pr_total = 0}, be_limits = 0x0, be_acl = 0x7, be_dfltaccess = -1743705582, be_extra_anlist = 0x563ea887d3ea <mdb_cmp_memn+42>, be_update_ndn = { bv_len = 140571327972704, bv_val = 0x7fd958cf96c0 "\276\227\317X\331\177"}, be_update_refs = 0xf0000, be_pending_csn_list = 0xffffffffffffffff, be_pcl_mutex = {__data = {__lock = 1489999504, __count = 32729, __owner = 1343363729, __nusers = 32729, __kind = 1343363726, __spins = 32729, __elision = 0, __list = {__prev = 0x7fd958cf96f0, __next = 0x580f975c}}, __size = "\220\226\317X\331\177\000\000\221\032\022P\331\177\000\000\216\032\022P\331\177\000\000\360\226\317X\331\177\000\000\\227\017X\000\000\000", __align = 140571474630288}, be_syncinfo = 0xffffffffffffffff, be_pb = 0x7fd958cf97a0, be_cf_ocs = 0x7fd958cf97c1, be_private = 0x7fd958cf97be, be_next = {stqe_next = 0x7fd958cf97b7}} #11 0x00007fda683d037f in syncprov_op_response (op=0x7fd958cf9f90, rs=<optimized out>) at syncprov.c:2115 maxcsn = {bv_len = 40, bv_val = 0x7fd958cf97a0 "20201208170213.107305Z#000000#002#000000"} cbuf = "20201208170213.107305Z#000000#002#000000\000\213\233Nи\336\371\000\000\000@\000\000\000\000\000\020\000\000\b\000\001" do_check = <optimized out> foundit = 1 csn_changed = <optimized out> have_psearches = 1 opc = <optimized out> on = <optimized out> si = <optimized out> sm = <optimized out>
Relevant syncrepl and syncprov config might also be helpful.
This is all I could find related to the MMR setup in our installation playbooks:
- index: 1 name: syncprov objectClass: olcSyncProvConfig
- index: 2 name: accesslog objectClass: olcAccessLogConfig attributes: olcAccessLogDB: cn=accesslog olcAccessLogOps: writes olcAccessLogSuccess: "TRUE" olcAccessLogPurge: 07+00:00 01+00:00 # remove stuff older than 7 days, every 1 day
- dn: "olcOverlay={0}syncprov,olcDatabase={2}mdb,cn=config" objectClass: - olcOverlayConfig - olcSyncProvConfig attributes: olcSpNoPresent: "TRUE" olcSpReloadHint: "TRUE"
- olcSyncrepl: index: "{{ item.0.index }}" values: >- {{ '{' ~ server_index + '}' }}rid={{ rid }} provider="ldaps://{{ hostvars[item.1].inventory_hostname }}" binddn="{{ item.0.replication_binddn }}" bindmethod=simple credentials="{{ item.0.replication_bindpw }}" searchbase="{{ item.0.olcSuffix }}" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemachecking=on syncdata=accesslog type=refreshAndPersist retry="5 5 300 +" timeout=1 tls_cacert={{ ca_cert_directory }}/ca.pem
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #4 from Quanah Gibson-Mount quanah@openldap.org --- Thanks, after reviewing the stack trace, in frame 10, can you please print *ss
Thanks!
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #5 from benjamin.demarteau@liege.be ---
Thanks, after reviewing the stack trace, in frame 10, can you please print *ss
(gdb) f 10 #10 0x00007fda683ce58f in syncprov_matchops (op=op@entry=0x7fd958cf9f90, opc=opc@entry=0x7fd950121ad0, saveit=saveit@entry=0) at syncprov.c:1393 1393 syncprov.c: No such file or directory. (gdb) print *ss $1 = {s_next = 0x0, s_si = 0x563eaa49b460, s_base = {bv_len = 12, bv_val = 0x7fd98c104c50 "cn=accesslog"}, s_eid = 1, s_op = 0x7fd99014f3b0, s_rid = 0, s_sid = 3, s_filterstr = {bv_len = 46, bv_val = 0x7fd98c000d98 "(&(objectClass=auditWriteObject)(reqResult=0))"}, s_flags = 17, s_inuse = 1, s_res = 0x0, s_restail = 0x0, s_mutex = {__data = {__lock = 1, __count = 0, __owner = 37086, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000ސ\000\000\001", '\000' <repeats 26 times>, __align = 1}}
https://bugs.openldap.org/show_bug.cgi?id=9421
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.5.1 Keywords| |OL_2_5_REQ, replication
https://bugs.openldap.org/show_bug.cgi?id=9421
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |reviewed Severity|normal |blocker Assignee|bugs@openldap.org |hyc@openldap.org
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #6 from Ondřej Kuzník ondra@mistotebe.net --- On Wed, Dec 09, 2020 at 09:09:09AM +0000, openldap-its@openldap.org wrote:
Thanks, after reviewing the stack trace, in frame 10, can you please print *ss
(gdb) f 10 #10 0x00007fda683ce58f in syncprov_matchops (op=op@entry=0x7fd958cf9f90, opc=opc@entry=0x7fd950121ad0, saveit=saveit@entry=0) at syncprov.c:1393 1393 syncprov.c: No such file or directory. (gdb) print *ss $1 = {s_next = 0x0, s_si = 0x563eaa49b460, s_base = {bv_len = 12, bv_val = 0x7fd98c104c50 "cn=accesslog"}, s_eid = 1, s_op = 0x7fd99014f3b0, s_rid = 0, s_sid = 3, s_filterstr = {bv_len = 46, bv_val = 0x7fd98c000d98 "(&(objectClass=auditWriteObject)(reqResult=0))"}, s_flags = 17, s_inuse = 1, s_res = 0x0, s_restail = 0x0, s_mutex = {__data = {__lock = 1, __count = 0, __owner = 37086, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000ސ\000\000\001", '\000' <repeats 26 times>, __align = 1}}
Looking at op2, o_req_dn and o_req_ndn values look suspect and seems some memory has been overwritten. Can you reproduce it with a non-optimised build, getting the output of "thread apply bt all", removing any private data?
Or even better, if it happens often enough in your environment, does running under valgrind sound like a feasible undertaking?
Thanks, Ondrej
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #7 from benjamin.demarteau@liege.be --- Created attachment 789 --> https://bugs.openldap.org/attachment.cgi?id=789&action=edit 2021-02-05-slapd 2.4.56 (Nov 20 2020 230143)-valgrind.log
Ran it with
valgrind --verbose --leak-check=full --time-stamp=yes --trace-children=yes --log-file=valgrind.log /usr/sbin/slapd -u ldap -h "ldap:/// ldaps:/// ldapi:///" -d 0
Here is the log.
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #8 from Ondřej Kuzník ondra@mistotebe.net --- That is strange, I guess getting information from a non-optimised build might be more useful - especially a full backtrace plus the detailed contents of the entry (argument 'e') and the assertion ('ava') in frame 6 to see which one contains the NULL pointer and hopefully figure out how it got there...
https://bugs.openldap.org/show_bug.cgi?id=9421
benjamin.demarteau@liege.be changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #789 is|0 |1 obsolete| |
--- Comment #9 from benjamin.demarteau@liege.be --- Created attachment 790 --> https://bugs.openldap.org/attachment.cgi?id=790&action=edit valgrind log for 2.5.57
I have updated to slapd 2.4.57 (Jan 18 2021 21:29:08), here is the valgrind output for it.
https://bugs.openldap.org/show_bug.cgi?id=9421
benjamin.demarteau@liege.be changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #790|valgrind log for 2.5.57 |valgrind log for 2.4.57 description| | Attachment #790|slapd-2.5.57-valgrind.log |slapd-2.4.57-valgrind.log filename| |
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #10 from benjamin.demarteau@liege.be --- I'll try and compile the latest version locally and see if I can reproduce the problem.
https://bugs.openldap.org/show_bug.cgi?id=9421
benjamin.demarteau@liege.be changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|2.4.56 |2.4.57
https://bugs.openldap.org/show_bug.cgi?id=9421
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.5.1 |2.5.3
https://bugs.openldap.org/show_bug.cgi?id=9421
--- Comment #11 from benjamin.demarteau@liege.be --- OK, in trying to reproduce this problem with our own build, we found the several problems that caused the crash:
* 2 servers were included in the replication config that were never provisioned * the first server had SID=0 * we were trying to replicate two DBs using a single accesslog db
After we fixed this by: * removing the 2 servers from the config * starting SIDs at 1 instead of 0 * changed the logfilter value:
- logfilter="(&(objectClass=auditWriteObject)(reqResult=0)" + logfilter="(&(objectClass=auditWriteObject)(reqResult=0)(reqDn:dnSubtreeMatch:={{ item.0.olcSuffix }}))"
Everything starting working just fine and we can't reproduce the crash anymore.
There are a log of invariants that the code expects but that aren't validated or documented (or the documentation is hard to find), sorry for wasting your time chasing this unicorn.
https://bugs.openldap.org/show_bug.cgi?id=9421
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Component|overlays |documentation Ever confirmed|0 |1 Status|UNCONFIRMED |CONFIRMED
--- Comment #12 from Howard Chu hyc@openldap.org --- (In reply to benjamin.demarteau from comment #11)
OK, in trying to reproduce this problem with our own build, we found the several problems that caused the crash:
- 2 servers were included in the replication config that were never
provisioned
- the first server had SID=0
- we were trying to replicate two DBs using a single accesslog db
After we fixed this by:
removing the 2 servers from the config
starting SIDs at 1 instead of 0
changed the logfilter value:
- logfilter="(&(objectClass=auditWriteObject)(reqResult=0)"
logfilter="(&(objectClass=auditWriteObject)(reqResult=0)(reqDn: dnSubtreeMatch:={{ item.0.olcSuffix }}))"
Everything starting working just fine and we can't reproduce the crash anymore.
There are a log of invariants that the code expects but that aren't validated or documented (or the documentation is hard to find), sorry for wasting your time chasing this unicorn.
Will keep this ticket open as a doc bug. Can you tell exactly which pieces of information you had trouble finding?
https://bugs.openldap.org/show_bug.cgi?id=9421
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.5.3 |--- Resolution|--- |FEEDBACK Status|CONFIRMED |RESOLVED Keywords|OL_2_5_REQ, replication, | |reviewed |
https://bugs.openldap.org/show_bug.cgi?id=9421
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED