Re: (ITS#5439) back-bdb race condition - openldap-bugs

25 Mar 2008


      Rein Tollevik wrote:
...
On Tue, 25 Mar 2008, hyc@symas.com wrote:
It now looks to me as if the entire rs->sr_entry was released an reused,
and that the bug probably is in back-bdb.  It just always happened when
syncprov was running as this is a master server with mostly writes
except from the reads syncprov does.  Which also means that the title on
the bug report is probably misleading :-(
The reason for my shift of focus is some highly suspicious variables
found in the bdb_search() frame.
Look at the value of idflag in bdb_search. (see the comment around 672)
Could be a locking bug with ID_NOCACHE in cache.c...
...
First, ei->bei_e seem to be NULL every time this happen, not the same as
the "e" variable as I would except from looking at the code.  Second,
e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or
less completely initialized) entry following the *ei entry in the
database, while I would expect that ei->bei_id==e->e_id and that
ei->bei_e==e.  I don't think the consecutive ID numbers or entries has
anything with the bug to do though, they are probably a result of the
searches reading the entire database.
I looks to me as if the EntryInfo referred to by ei was released somehow
when slap_send_search_entry() was running, and that the ei->bei_e passed
as rs->rs_entry was first released and then reused for the next entry.
Alternatively that the structures wasn't properly initialized and
thereby appeared free to the second thread.
I have found a core file that adds to my suspicions, this time with two
active threads.  Their stack frames are shown at the end.  In this core
file the id variable in the bdb_search() frame is 9318, the second
thread is loading id==9319.  The value of the "e" variable in
bdb_search() for the first thread equals the "x" variable of the
entry_decode() frame in the second thread.  Since entry_decode() has
just allocated its "x" value it shouldn't have found the same value as
the other thread has unless someone released its "e" entry while it was
still using it.  Why this happened I haven't figured out yet.
The cachsesize directive has its default value, which is way below the
optimal value.  The result is that the cache is always full.  I don't
know if that can have any influence on the bug?  I tried to lower its
value to see if that should help reproducing the problem, but not so
far.
...
...
I'm currently trying to gather more information related to this bug, any
pointers as to what I should look for is appreciated.  I'm posting this bug
report now in the hope that the stack frame should enlighten someone with better
knowledge of the code than what I have.
Check for stack overruns, compile without optimization and make sure it's not
a compiler optimization bug, etc.
I'll try to run it with a version compiled with only -g and see if the
same thing should happen then.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/