Re: slapd deadlock bug ITS#7296 - openldap-devel

20 Dec 2012


      Richard Silverman wrote:
...
On Thu, 20 Dec 2012, Howard Chu wrote:
...
It would probably be helpful for you to post your test code as well so other
developers can use it to reproduce the issue.
As I said in the paragraph you quoted just above, I’m using the test code
provided in the referenced bug report. It required only minimal obvious
tweaking to get it to work. I can provide a tarball of my own setup if
that would be helpful.
Hm, I missed the link to attachments in the original ITS. I see it now.
...
...
...
The problem is that they are *all* here; there is no one left to signal
that condition variable, so slapd locks up forever.
The previously quoted code clearly sets conn->c_writing to 0 if a hard error
occurs. What are the values in *conn when this hang happens?
Also, when a connection error is detected, if you had CONNS debug enabled, 
there would be an error message to this effect. Have you enabled connection 
debug output, and do you see these messages in the slapd debug output right 
before the hang?
...
Yes, it does do that; nevertheless, conn->c_writing == 1 in all threads
when the deadlock occurs.
...
Which thread owns the write1 mutex at that point?
I don’t know how to determine that. Here’s a dump of *conn in one of the
threads:
$21 = {c_struct_state = SLAP_C_USED, c_conn_state = SLAP_C_ACTIVE, c_conn_idx = 39, c_sd = 39,
    c_close_reason = 0x5add3b "?", c_mutex = {wrapped = {__data = {__lock = 0, __count = 0, __owner = 0,
          __nusers = 0, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, usage = {
        magic = ldap_debug_magic, self = 18446603377522038887, mem = {ptr = 0x0, num = 0},
        state = ldap_debug_state_inited}, owner = 18446744073709551615}, c_sb = 0x7ff5f0114c30,
    c_starttime = 1356066580, c_activitytime = 1356066580, c_connid = 1370, c_peer_domain = {bv_len = 7,
      bv_val = 0x7ff630001140 "unknown"}, c_peer_name = {bv_len = 18, bv_val = 0x7ff630001120 "IP=127.0.0.1:42441"},
    c_listener = 0x1325390, c_sasl_bind_mech = {bv_len = 0, bv_val = 0x0}, c_sasl_dn = {bv_len = 0, bv_val = 0x0},
    c_sasl_authz_dn = {bv_len = 0, bv_val = 0x0}, c_authz_backend = 0x13767e0, c_authz_cookie = 0x0, c_authz = {
      sai_method = 128, sai_mech = {bv_len = 0, bv_val = 0x0}, sai_dn = {bv_len = 24,
        bv_val = 0x7ff624111980 "cn=manager,dc=alu,dc=com"}, sai_ndn = {bv_len = 24,
        bv_val = 0x7ff6241086c0 "cn=manager,dc=alu,dc=com"}, sai_ssf = 0, sai_transport_ssf = 0, sai_tls_ssf = 0,
      sai_sasl_ssf = 0}, c_protocol = 3, c_ops = {stqh_first = 0x7ff62813db60, stqh_last = 0x7ff6241108c8},
    c_pending_ops = {stqh_first = 0x7ff6240020c0, stqh_last = 0x7ff624139028}, c_write1_mutex = {wrapped = {
        __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 7, __kind = 2, __spins = 0, __list = {
            __prev = 0x0, __next = 0x0}},
c_write1_mutex owner=0, which means nobody owns it currently.
...
...
As for the rest - "reverse engineering" would be something like taking a
compiled binary and trying to decompile it. Reading source code is simply
that, it's not reverse engineering.
This is a matter of definition. To me, reverse engineering is when you
take insufficiently commented code and try to divine the organizing
principles behind it; it doesn’t matter if the code you’re looking at it
is machine code, assembly, or C (and I’ve done plenty of all of those).
...
The variables are documented with their purpose and usage, the comments
show what a block of code aims to do. What more do you expect?
Perhaps I have missed some documentation. What would help here is some
high-level description of the various synchronization primitives involved,
their consumers, what critical regions they aim to protect and how their
usage is supposed to avoid deadlock, along with detailed comments to that
effect. Here are some examples. This code in send_ldap_ber():
      conn->c_writers++;

      while ( conn->c_writers > 0 && conn->c_writing ) {
              ldap_pvt_thread_cond_wait( &conn->c_write1_cv, &conn->c_write1_mutex );
      }

      /* connection was closed under us */
      if ( conn->c_writers < 0 ) {
              /* we're the last waiter, let the closer continue */
              if ( conn->c_writers == -1 )
                      ldap_pvt_thread_cond_signal( &conn->c_write1_cv );
              conn->c_writers++;
              ldap_pvt_thread_mutex_unlock( &conn->c_write1_mutex );
              return 0;
      }


There are almost no comments here. conn->c_writers is mysteriously
incremented, perhaps to indicate that we are now intending to write?
slap.h c_writers /* number of writers waiting */
...
But
then this thread blocks waiting for it to become <= 0. Why would it do
that after we incremented it, when other threads will presumably only
decrement if they had previously incremented it?
Use grep.
...
We are signalling
c_write1_cv; who is waiting on it?
Use grep.
...
Why will waking them up help? The next
section refers to magic numbers 0 and -1 with important implications such
as our being the “last waiter,” without explaining why that would be so,
/* connection was closed under us */ is the explanation.
...
again silently increments conn->c_writers (why when we already did that?),
The c_writers is now negative, obviously. The increment is moving it toward zero.
...
signals a condition variable and unlocks a mutex without mentioning the
purpose of doing so, and returns 0 without saying what the effect of that
should be.
I can make lots of guesses as to what might be going on here, but that’s
all: guesses, along with lots of experimentation to figure out what’s
missing, and that is reverse engineering.
The entire code base is at your fingertips. If you're making guesses then you 
just haven't read enough.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/