rein@basefarm.no wrote:
Full_Name: Rein Tollevik Version: CVS head OS: CentOS 4.4 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (81.93.160.250)
We have bin hit by what looks like a race condition bug in syncprov. We got some core dumps all showing stack frames like the one at the end. As such nasty bugs tends to do it have behaved OK after I restarted slapd with more debug output :-( (trace + stats + stats2 + sync).
The configuration is a master server with multiple bdb backend databases all being subordinate to the same glue database where syncprov is used. One of the backends is a syncrepl consumer from another server, the server is master for the other backends. There are multiple consumers for the syncprov suffix, which I assume is what causes the race condition to happen.
Note the a=0xBAD argument to attr_find(), which I expect is the result of some other thread freeing the attribute list it was called with while it was processing it. The rs->sr_entry->e_attrs argument passed to attr_find() as the original "a" argument by findpres_cb() looks like a perfectly valid structure, as are all the attributes found by following the a_next pointer. The list is terminated by an attribute with a NULL a_next value, none of the a_next values are 0xBAD.
I don't believe that's the cause. Notice that arg0 in stack frame #9 is also 0xbad, even though it is shown correctly in frames 8 and 10. Something else is going on.
I'm currently trying to gather more information related to this bug, any pointers as to what I should look for is appreciated. I'm posting this bug report now in the hope that the stack frame should enlighten someone with better knowledge of the code than what I have.
Check for stack overruns, compile without optimization and make sure it's not a compiler optimization bug, etc.
Rein Tollevik Basefarm AS
#0 0x0807d03a in attr_find (a=0xbad, desc=0x81e8680) at attr.c:665 #1 0xb7a656f6 in findpres_cb (op=0xaf068ba4, rs=0xaf068b68) at syncprov.c:546 #2 0x0808416d in slap_response_play (op=0xaf068ba4, rs=0xaf068b68) at result.c:307 #3 0x0808555b in slap_send_search_entry (op=0xaf068ba4, rs=0xaf068b68) at result.c:770 #4 0x080f2cdc in bdb_search (op=0xaf068ba4, rs=0xaf068b68) at search.c:870 #5 0x080db72b in overlay_op_walk (op=0xaf068ba4, rs=0xaf068b68, which=op_search, oi=0x8274218, on=0x8274318) at backover.c:653 #6 0x080dbcaf in over_op_func (op=0xaf068ba4, rs=0xaf068b68, which=op_search) at backover.c:705 #7 0x080dbdef in over_op_search (op=0xaf068ba4, rs=0xaf068b68) at backover.c:727 #8 0x080d9570 in glue_sub_search (op=0xaf068ba4, rs=0xaf068b68, b0=0xaf068ba4, on=0xaf068ba4) at backglue.c:340 #9 0x080da131 in glue_op_search (op=0xbad, rs=0xaf068b68) at backglue.c:459 #10 0x080db6d5 in overlay_op_walk (op=0xaf068ba4, rs=0xaf068b68, which=op_search, oi=0x8271860, on=0x8271a60) at backover.c:643 #11 0x080dbcaf in over_op_func (op=0xaf068ba4, rs=0xaf068b68, which=op_search) at backover.c:705 #12 0x080dbdef in over_op_search (op=0xaf068ba4, rs=0xaf068b68) at backover.c:727 #13 0xb7a65ff4 in syncprov_findcsn (op=0x85c7e60, mode=FIND_PRESENT) at syncprov.c:700 #14 0xb7a670a0 in syncprov_op_search (op=0x85c7e60, rs=0xaf06a1c0) at syncprov.c:2277 #15 0x080db6d5 in overlay_op_walk (op=0x85c7e60, rs=0xaf06a1c0, which=op_search, oi=0x8271860, on=0x8271b60) at backover.c:643 #16 0x080dbcaf in over_op_func (op=0x85c7e60, rs=0xaf06a1c0, which=op_search) at backover.c:705 #17 0x080dbdef in over_op_search (op=0x85c7e60, rs=0xaf06a1c0) at backover.c:727 #18 0x08076554 in fe_op_search (op=0x85c7e60, rs=0xaf06a1c0) at search.c:368 #19 0x080770e4 in do_search (op=0x85c7e60, rs=0xaf06a1c0) at search.c:217 #20 0x08073e28 in connection_operation (ctx=0xaf06a2b8, arg_v=0x85c7e60) at connection.c:1084 #21 0x08074f14 in connection_read_thread (ctx=0xaf06a2b8, argv=0x59) at connection.c:1211 #22 0xb7fb5546 in ldap_int_thread_pool_wrapper (xpool=0x81ee240) at tpool.c:663 #23 0xb7c80371 in start_thread () from /lib/tls/libpthread.so.0 #24 0xb7c17ffe in clone () from /lib/tls/libc.so.6