Re: (ITS#6310) Slapd with pcache crashes under load - openldap-bugs

5 Oct 2009


      luben karavelov wrote:
...
masarati@aero.polimi.it wrote:
...
Thanks for collecting this info.  The valgrind output could be of some
use, but unfortunately I don't have time right now to set up a working
RDBMS and extensively debug things.  I'll keep this on my todo list.
You should please re-run valgrind with --num-callers=30 or more, because
in some cases errors are in too nested functions to get a clear idea of
whether the issue is caused by garbage fed by slapd/back-sql or by errors
inside the RDBMS/ODBC layers.  The fact that valgrind systematically
complains about internals of the RDBMS/ODBC reading past the end of 
memory
chunks malloc'ed by slapd could be related to passing some non-nul
terminated bervals that are dealt with as strings.  Having a longer call
stack could help tracking those occurrences.  However, those issues 
should
not be critical, since there's no invalid writes.
Also, you should walk through the list of attributes being returned, to
provide a hint about whether back-sql is computing a screwed attrlist or
so.  Along the lines of your current gdb session, you should get to frame
#5, refresh_merge() in pcache.c, and print *e->e_attrs,
*e->e_attrs->a_desc, *e->e_attrs->a_vals[0]; then move to
e->e_attrs->a_next and repeat the prints to the end of the list.  The 
fact
you get a value of "a" equal to 0x500000000 looks definitely odd to 
me, as
that attr list should result from be_entry_get_rw(), which in turn should
collect it from the local database.  Unless valgrind reveals some oddity
in back-sql, the behavior you notice should not depend on the specific
remote database you're using, but rather from the local one.
p.
Hello,
Tomorrow I will make a setup with pure sql process and a pure pcache 
daemon that reads from the first over unix domain socket. In this manner 
it will be clear if the crashing part is related to back-sql and the 
database drivers/ODBC manager or not.
Meanwhile, you could find the requested debugging session here:
http://purgatory.spnet.net/~karavelov/attr_list/gdb-1
It seems that the "e" pointer is corrupted.
Good catch.
...
Tomorrow I will start it 
through valgrind with more back-frames as requested
Another quick check you could probably do relatively quickly is zero out 
that "e" pointer before calling be_entry_get_rw() within refresh_merge().
Thanks, p.