masarati@aero.polimi.it wrote:
Both, as well as when running the head tests suite with the 2.4.23 release. Looks as if the swamp additions have tripped into an existing problem, not anything new. Leave it out of RE24 until if have been resolved?
Btw, any other Solaris test runs out there? I´t like to know if it is a real Solaris problem or just me..
I'm seeing a similar failure on 32 bit Sparc Solaris 10. But it actually locks up in test036 for me, I never get as far as test039. The gdb trace looks much the same as what you posted.
Looks like for some reason threads that are blocked waiting for their sockets to become writable are never getting waken up. A regular SIGINT shuts down slapd cleanly so it doesn't appear to be a problem with the condvars being used to manage the threads. That kinda points to select() simply not returning the writable status.
I haven't used this Solaris machine much, but in fact (looking at the remnants of other files in my source tree on this box) this appears to have been a problem since at least last August. (I.e., it looks like I was investigating this same problem back then but dropped it and never got back to it.)
Not sure whether it is related, but I'm currently running test036 with -DLDAP_THREAD_DEBUG (for unrelated purposes) and I see some mutex-related failures, of the type
conn=1031 op=1 SRCH base="cn=Monitor" scope=2 deref=0 filter="(objectClass=*)" ../../../ldap-2.4-src/libraries/libldap_r/thr_debug.c:1029: ldap_pvt_thread_mutex_unlock error: !THREAD_MUTEX_OWNER( mutex ) ../../../ldap-2.4-src/libraries/libldap_r/thr_debug.c:1033: ldap_pvt_thread_mutex_unlock error: rc is 1
I see a lot of them; they always appear within operations affecting back-monitor, this seems to be consistent with Rein's backtrace.
uname -a Linux fl1 2.6.34.7-0.5-desktop #1 SMP PREEMPT 2010-10-25 08:40:12 +0200 x86_64 x86_64 x86_64 GNU/Linux
Running with valgrind/helgrind, I get a hang on Linux too. Unfortunately I can't get a backtrace from the valgrind'd slapd. It shows a fair number of data races in back-meta.
There are also some lock ordering issues, but we already know about most of them and the code avoids deadlock using trylock() when needed. But there are a couple that don't, and thus are deadlock hazards. (request and abandon in libldap seems to be the prime offender.)
I've uploaded my testrun directory to http://highlandsun.com/hyc/20110111-testr.tgz
for reference. (Looks like ftp.openldap.org is full again.)