Richard Silverman wrote:
On Thu, 20 Dec 2012, Howard Chu wrote:
It would probably be helpful for you to post your test code as well so other developers can use it to reproduce the issue.
As I said in the paragraph you quoted just above, I’m using the test code provided in the referenced bug report. It required only minimal obvious tweaking to get it to work. I can provide a tarball of my own setup if that would be helpful.
Hm, I missed the link to attachments in the original ITS. I see it now.
The problem is that they are *all* here; there is no one left to signal that condition variable, so slapd locks up forever.
The previously quoted code clearly sets conn->c_writing to 0 if a hard error occurs. What are the values in *conn when this hang happens?
Also, when a connection error is detected, if you had CONNS debug enabled, there would be an error message to this effect. Have you enabled connection debug output, and do you see these messages in the slapd debug output right before the hang?
Yes, it does do that; nevertheless, conn->c_writing == 1 in all threads when the deadlock occurs.
Which thread owns the write1 mutex at that point?
I don’t know how to determine that. Here’s a dump of *conn in one of the threads:
$21 = {c_struct_state = SLAP_C_USED, c_conn_state = SLAP_C_ACTIVE, c_conn_idx = 39, c_sd = 39, c_close_reason = 0x5add3b "?", c_mutex = {wrapped = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, usage = { magic = ldap_debug_magic, self = 18446603377522038887, mem = {ptr = 0x0, num = 0}, state = ldap_debug_state_inited}, owner = 18446744073709551615}, c_sb = 0x7ff5f0114c30, c_starttime = 1356066580, c_activitytime = 1356066580, c_connid = 1370, c_peer_domain = {bv_len = 7, bv_val = 0x7ff630001140 "unknown"}, c_peer_name = {bv_len = 18, bv_val = 0x7ff630001120 "IP=127.0.0.1:42441"}, c_listener = 0x1325390, c_sasl_bind_mech = {bv_len = 0, bv_val = 0x0}, c_sasl_dn = {bv_len = 0, bv_val = 0x0}, c_sasl_authz_dn = {bv_len = 0, bv_val = 0x0}, c_authz_backend = 0x13767e0, c_authz_cookie = 0x0, c_authz = { sai_method = 128, sai_mech = {bv_len = 0, bv_val = 0x0}, sai_dn = {bv_len = 24, bv_val = 0x7ff624111980 "cn=manager,dc=alu,dc=com"}, sai_ndn = {bv_len = 24, bv_val = 0x7ff6241086c0 "cn=manager,dc=alu,dc=com"}, sai_ssf = 0, sai_transport_ssf = 0, sai_tls_ssf = 0, sai_sasl_ssf = 0}, c_protocol = 3, c_ops = {stqh_first = 0x7ff62813db60, stqh_last = 0x7ff6241108c8}, c_pending_ops = {stqh_first = 0x7ff6240020c0, stqh_last = 0x7ff624139028}, c_write1_mutex = {wrapped = { __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 7, __kind = 2, __spins = 0, __list = { __prev = 0x0, __next = 0x0}},
c_write1_mutex owner=0, which means nobody owns it currently.
As for the rest - "reverse engineering" would be something like taking a compiled binary and trying to decompile it. Reading source code is simply that, it's not reverse engineering.
This is a matter of definition. To me, reverse engineering is when you take insufficiently commented code and try to divine the organizing principles behind it; it doesn’t matter if the code you’re looking at it is machine code, assembly, or C (and I’ve done plenty of all of those).
The variables are documented with their purpose and usage, the comments show what a block of code aims to do. What more do you expect?
Perhaps I have missed some documentation. What would help here is some high-level description of the various synchronization primitives involved, their consumers, what critical regions they aim to protect and how their usage is supposed to avoid deadlock, along with detailed comments to that effect. Here are some examples. This code in send_ldap_ber():
conn->c_writers++; while ( conn->c_writers > 0 && conn->c_writing ) { ldap_pvt_thread_cond_wait( &conn->c_write1_cv, &conn->c_write1_mutex ); } /* connection was closed under us */ if ( conn->c_writers < 0 ) { /* we're the last waiter, let the closer continue */ if ( conn->c_writers == -1 ) ldap_pvt_thread_cond_signal( &conn->c_write1_cv ); conn->c_writers++; ldap_pvt_thread_mutex_unlock( &conn->c_write1_mutex ); return 0; }
There are almost no comments here. conn->c_writers is mysteriously incremented, perhaps to indicate that we are now intending to write?
slap.h c_writers /* number of writers waiting */
But then this thread blocks waiting for it to become <= 0. Why would it do that after we incremented it, when other threads will presumably only decrement if they had previously incremented it?
Use grep.
We are signalling c_write1_cv; who is waiting on it?
Use grep.
Why will waking them up help? The next section refers to magic numbers 0 and -1 with important implications such as our being the “last waiter,” without explaining why that would be so,
/* connection was closed under us */ is the explanation.
again silently increments conn->c_writers (why when we already did that?),
The c_writers is now negative, obviously. The increment is moving it toward zero.
signals a condition variable and unlocks a mutex without mentioning the purpose of doing so, and returns 0 without saying what the effect of that should be.
I can make lots of guesses as to what might be going on here, but that’s all: guesses, along with lots of experimentation to figure out what’s missing, and that is reverse engineering.
The entire code base is at your fingertips. If you're making guesses then you just haven't read enough.