Full_Name: Tracy Stenvik Version: 2.4.16 (+db-4.7.25pl4) OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga) URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (140.142.111.41)
In evaluating 2.4.16 (to upgrade our 2.3.39-based directory service) we have encountered some serious issues. A similated production-level load that 2.3.39 (with slurpd running) handles easily causes 2.4.16 (and 2.4.15) to crash within 5 minutes to 24 hours.
Without replication active, 2.4.16 survived 24 hours under load. As soon as replication was activated however, the server crashed within 5 minutes. We've tried several syncrepl variants (as described below) but none appear stable.
In analyzing the various crash dumps, most of the trouble appears to lie in the caching code. It has a nasty habit of returing nulls when nulls are not expected.
Our simulated load consists of an arbitrary mix of adds/modifies/searches to various objects in the directory. This load is defined as "N" instances of the "Stressor" test, as discussed on the "GDS Stress Test Findings" page at,
http://wiki.cac.washington.edu/display/infra/GDS+Stress+Test+Findings
Following, you will find a series of crash dumps, along with a few comments. Also included are sections the various slapd.confs we were testing with.
Please contact me if you require additional information.
delta syncrepl crash -------------------- (gdb) where #0 0xb7b7abab in strncmp () from /lib/libc.so.6 #1 0x08143ea4 in bdb_rdn_cmp (v_e1=0xf069be8, v_e2=0xf73cb48) at cache.c:301 #2 0x081d00ee in avl_insert (root=0x8622088, data=0xf069be8, fcmp=0x8143e55 <bdb_rdn_cmp>, fdup=0x81d10a0 <avl_dup_error>) at avl.c:95 #3 0x08144692 in hdb_cache_find_parent (op=0xf95ea78, txn=0xe7def98, id=343637, res=0x14b3ade0) at cache.c:593 #4 0x08144dc9 in hdb_cache_find_id (op=0xf95ea78, tid=0xe7def98, id=343637, eip=0x14b3ade0, flag=0, lock=0x14b3adc4) at cache.c:905 #5 0x0810b6a0 in hdb_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:705 #6 0x080e9b02 in overlay_op_walk (op=0xf95ea78, rs=0x14bfc0e8, which=op_search, oi=0x85aeed8, on=0x0) at backover.c:669 #7 0x080e9cb7 in over_op_func (op=0xf95ea78, rs=0x14bfc0e8, which=op_search) at backover.c:721 #8 0x080e9d3b in over_op_search (op=0xf95ea78, rs=0x14bfc0e8) at backover.c:743 #9 0x0806f0b3 in fe_op_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:366 #10 0x0806ea53 in do_search (op=0xf95ea78, rs=0x14bfc0e8) at search.c:217 #11 0x0806b959 in connection_operation (ctx=0x14bfc1d0, arg_v=0xf95ea78) at connection.c:1097 #12 0x0806be33 in connection_read_thread (ctx=0x14bfc1d0, argv=0x23) at connection.c:1223 #13 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8561178) at tpool.c:663 #14 0xb7e4d49b in start_thread () from /lib/libpthread.so.0 #15 0xb7bdc42e in clone () from /lib/libc.so.6 (gdb) p *(EntryInfo *)v_e1 $2 = {bei_parent = 0x8622058, bei_id = 343637, bei_lockpad = 0, bei_state = 132, bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0xf7af1f8 "serialNumber=40442595174b5ebf53167252b5cbcc34"}, bei_rdn = {bv_len = 45, bv_val = 0xf36d700 "serialNumber=40442595174B5EBF53167252B5CBCC34"}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 1, __count = 0, __owner = 32550, __kind = 0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size = "\001\000\000\000\000\000\000\000&\177\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 1}, bei_lrunext = 0x8978ad0, bei_lruprev = 0x0} (gdb) p *(EntryInfo *)v_e2 $3 = {bei_parent = 0x0, bei_id = 410257, bei_lockpad = 0, bei_state = 0, bei_finders = 0, bei_nrdn = {bv_len = 45, bv_val = 0x0}, bei_rdn = { bv_len = 45, bv_val = 0x0}, bei_modrdns = 0, bei_ckids = 0, bei_dkids = 0, bei_e = 0x0, bei_kids = 0x0, bei_kids_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\0' <repeats 23 times>, __align = 0}, bei_lrunext = 0x8978ad0, bei_lruprev = 0x0}
In bdb_rdn_cmp() you're comparing e1->bei_nrdn.bv_val to e2->bei_nrdn.bv_val but the latter is null even though e2->bei_nrdn.bv_len is not zero.
standard syncrepl crash ----------------------- (gdb) where #0 0xb7d9f300 in pthread_mutex_lock () from /lib/libpthread.so.0 #1 0x081d8645 in ldap_pvt_thread_mutex_lock (mutex=0x34) at thr_posix.c:296 #2 0x08145bc0 in bdb_cache_delete_internal (cache=0xa25cc04, e=0x147e6d38, decr=0) at cache.c:1372 #3 0x08144b08 in hdb_cache_lru_purge (bdb=0xa25cbc8) at cache.c:776 #4 0x081452d1 in hdb_cache_find_id (op=0x11923ca8, tid=0x1316aa48, id=280869, eip=0x2bffcbfc, flag=0, lock=0x2bffcc54) at cache.c:1053 #5 0x0814b2f7 in hdb_dn2entry (op=0x11923ca8, tid=0x1316aa48, dn=0xa25e3bc, e=0x2bffcc64, matched=0, lock=0x2bffcc54) at dn2entry.c:67 #6 0x08151845 in hdb_entry_get (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0, at=0xa23ebb0, rw=0, ent=0x2bffceac) at id2entry.c:354 #7 0x080e936d in overlay_entry_get_ov (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac, on=0x0) at backover.c:364 #8 0x080e941f in over_entry_get_rw (op=0x11923ca8, dn=0xa25e3bc, oc=0xa2549f0, ad=0xa23ebb0, rw=0, e=0x2bffceac) at backover.c:396 #9 0x0807cfc3 in be_entry_get_rw (op=0x11923ca8, ndn=0xa25e3bc, oc=0xa2549f0, at=0xa23ebb0, rw=0, e=0x2bffceac) at backend.c:1346 #10 0x0807d1a2 in fe_acl_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc, op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1402 #11 0x0807d8c3 in backend_group (op=0x11923ca8, target=0x0, gr_ndn=0xa25e3bc, op_ndn=0x11923d44, group_oc=0xa2549f0, group_at=0xa23ebb0) at backend.c:1597 #12 0x080d32e2 in limits_get (op=0x11923ca8, limit=0x11923cdc) at limits.c:117 #13 0x080d588d in limits_check (op=0x11923ca8, rs=0x2bffe0e8) at limits.c:1106 #14 0x0806f095 in fe_op_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:364 #15 0x0806ea53 in do_search (op=0x11923ca8, rs=0x2bffe0e8) at search.c:217 #16 0x0806b959 in connection_operation (ctx=0x2bffe1d0, arg_v=0x11923ca8) at connection.c:1097 #17 0x0806be33 in connection_read_thread (ctx=0x2bffe1d0, argv=0x1d) at connection.c:1223 #18 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0xa211178) at tpool.c:663 #19 0xb7d9d49b in start_thread () from /lib/libpthread.so.0 #20 0xb7b2c42e in clone () from /lib/libc.so.6 (gdb) p *e->bei_parent Cannot access memory at address 0x0
In bdb_cache_delete_internal(), e->bei_parent is null.
2-way multimaster syncrepl crash -------------------------------- (gdb) where #0 0x08149c1a in hdb_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:217 #1 0x080e9b02 in overlay_op_walk (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete, oi=0x88b4550, on=0x0) at backover.c:669 #2 0x080e9cb7 in over_op_func (op=0xc01caa8, rs=0x2ecfc0e8, which=op_delete) at backover.c:721 #3 0x080e9de5 in over_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at backover.c:773 #4 0x0808ae88 in fe_op_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:174 #5 0x0808abdb in do_delete (op=0xc01caa8, rs=0x2ecfc0e8) at delete.c:95 #6 0x0806b959 in connection_operation (ctx=0x2ecfc1d0, arg_v=0xc01caa8) at connection.c:1097 #7 0x0806be33 in connection_read_thread (ctx=0x2ecfc1d0, argv=0x22) at connection.c:1223 #8 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8867178) at tpool.c:663 #9 0xb7e4749b in start_thread () from /lib/libpthread.so.0 #10 0xb7bd642e in clone () from /lib/libc.so.6 (gdb) p eip $1 = (EntryInfo *) 0x0
In hdb_delete(), eip is null.
2-way multimaster syncrepl crash -------------------------------- (gdb) where #0 0x080956db in slap_acl_get (a=0x8b9ba80, count=0x37f5d824, op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, matches=0x37f5d1c4, state=0x37f5d97c) at acl.c:599 #1 0x080949b9 in slap_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:236 #2 0x08094cd5 in fe_access_allowed (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x37f5d8e8) at acl.c:360 #3 0x0809503c in access_allowed_mask (op=0x8c5d948, e=0x37f5dad0, desc=0x8b9b9f0, val=0x0, access=ACL_WDEL, state=0x37f5d97c, maskp=0x0) at acl.c:462 #4 0x08098b8f in acl_check_modlist (op=0x8c5d948, e=0x37f5dad0, mlist=0x34afa7c8) at acl.c:1986 #5 0x0810373a in hdb_modify_internal (op=0x8c5d948, tid=0x961aa10, modlist=0x34afa7c8, e=0x37f5dad0, text=0x37f5f0fc, textbuf=0x37f5db18 "", textlen=256) at modify.c:88 #6 0x08105292 in hdb_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:590 #7 0x080e9b02 in overlay_op_walk (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify, oi=0x8bbb550, on=0x0) at backover.c:669 #8 0x080e9cb7 in over_op_func (op=0x8c5d948, rs=0x37f5f0e8, which=op_modify) at backover.c:721 #9 0x080e9d7f in over_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at backover.c:755 #10 0x08088a89 in fe_op_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:301 #11 0x080884c8 in do_modify (op=0x8c5d948, rs=0x37f5f0e8) at modify.c:175 #12 0x0806b959 in connection_operation (ctx=0x37f5f1d0, arg_v=0x8c5d948) at connection.c:1097 #13 0x0806be33 in connection_read_thread (ctx=0x37f5f1d0, argv=0x22) at connection.c:1223 #14 0x081d74a1 in ldap_int_thread_pool_wrapper (xpool=0x8b6e178) at tpool.c:663 #15 0xb7e6749b in start_thread () from /lib/libpthread.so.0 #16 0xb7bf642e in clone () from /lib/libc.so.6 (gdb) p e->e_nname $3 = {bv_len = 77, bv_val = 0x0}
In slap_acl_get(), e->e_nname.bv_val (e->e_ndn) is null.
delta syncrepl slapd.conf ------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290
gentlehup on idletimeout 600 sizelimit 20000 sortvals member
# ===== Replication db ==== database hdb suffix "cn=log" directory /data/au20aa_log rootdn "cn=Manager,dc=washington,dc=edu" #index objectClass eq #index entryUUID,entryCSN eq index default eq index entryCSN,objectClass,reqEnd,reqResult,reqStart
overlay syncprov syncprov-reloadhint true syncprov-nopresent true
# ===== Primary db === database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq
checkpoint 10000 15 cachesize 90000 idlcachesize 180000
overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60
# accesslog overlay definitions for primary database overlay accesslog logdb cn=log logops writes logsuccess true
# scan the accesslog DB every day, and purge entries older than 7 days logpurge 07+00:00 01+00:00
standard syncrepl slapd.conf ---------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290
gentlehup on idletimeout 600 sizelimit 20000 sortvals member
# ===== Primary db ===== database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq
checkpoint 10000 15 cachesize 90000 idlcachesize 180000
overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60
2-way multimaster syncrepl slapd.conf ------------------------------------- serverID 1 ldap://rufus01.u.washington.edu:290 serverID 2 ldap://rufus04.u.washington.edu:290
gentlehup on idletimeout 600 sizelimit 20000 sortvals member
# ===== Primary db ===== database hdb suffix "dc=washington,dc=edu" rootdn "cn=Manager,dc=washington,dc=edu" include /usr/local/openldap-2.4.16/etc/openldap/private/slapd-pw.conf directory /data/au20aa index entryCSN eq index entryUUID eq
checkpoint 10000 15 cachesize 90000 idlcachesize 180000
overlay syncprov #syncprov-sessionlog 100 syncprov-checkpoint 1000 60
### ### ----------- Replication ###
syncrepl rid=101 provider=ldap://rufus01.u.washington.edu:290 searchbase="dc=washington,dc=edu" bindmethod=sasl saslmech=external starttls=yes tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key filter="(objectClass=*)" attrs="*,+" schemachecking=off scope=sub type=refreshAndPersist retry="60 +"
syncrepl rid=102 provider=ldap://rufus04.u.washington.edu:290 searchbase="dc=washington,dc=edu" bindmethod=sasl saslmech=external starttls=yes tls_cert=/usr/local/openldap-2.4.16/etc/openldap/rufus04.crt tls_key=/usr/local/openldap-2.4.16/etc/openldap/private/rufus04.key filter="(objectClass=*)" attrs="*,+" schemachecking=off scope=sub type=refreshAndPersist retry="60 +"
mirrormode yes