Several more fixes completed, please test.
Thanks!
Still waiting on confirmation from someone with gnutls. ;)
--Quanah
--On February 9, 2009 8:11:20 AM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
Please test current RE24 CVS. In particular, if people can test:
(a) Building without TLS & without sasl (b) Building without TLS (c) Building without sasl (d) If at least one person can build against GnuTLS instead of OpenSSL
Much appreciated. :)
Thanks,
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Tue, 10 Feb 2009, Quanah Gibson-Mount wrote:
Several more fixes completed, please test.
Fails test050.
<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)" slapd destroy: freeing system resources. slapd stopped. connections_destroy: nothing to destroy.
testrun dir: https://www.nbcs.rutgers.edu/~richton/test050.fail.RE24-20090210.tar.bz2
Still waiting on confirmation from someone with gnutls. ;)
During the original call, I made one run through without SASL nor OpenSSL. I've switched back to a full-blown configuration at this point.
Aaron Richton writes:
<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)"
The 2nd contextCSN in that entry is garbage:
dn: cn=config objectClass: olcGlobal cn: config structuralObjectClass: olcGlobal creatorsName: cn=config createTimestamp: 20090211053745Z entryUUID: dade8f04-8c49-102d-935b-1f79305f5c76 olcServerID: 1 ldap://localhost:9011/ olcServerID: 2 ldap://localhost:9012/ olcServerID: 3 ldap://localhost:9013/ entryCSN: 20090211053747.167910Z#000000#001#000000 modifiersName: cn=config modifyTimestamp: 20090211053747Z contextCSN: 20090211053750.548808Z#000000#001#000000 contextCSN:: fH/zsAAkIZgAJAtgAEP0QABD8nAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
Hallvard B Furuseth wrote:
Aaron Richton writes:
<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)"
The 2nd contextCSN in that entry is garbage:
Looking at the log file, it appears to have been corrupted during the checkpoint on the previous shutdown.
I've added some checks/asserts to syncrepl.c and syncprov.c in HEAD. Can you please incorporate those asserts into your build and try to reproduce this? So far I've run test050 50 times and not seen any problem.
dn: cn=config objectClass: olcGlobal cn: config structuralObjectClass: olcGlobal creatorsName: cn=config createTimestamp: 20090211053745Z entryUUID: dade8f04-8c49-102d-935b-1f79305f5c76 olcServerID: 1 ldap://localhost:9011/ olcServerID: 2 ldap://localhost:9012/ olcServerID: 3 ldap://localhost:9013/ entryCSN: 20090211053747.167910Z#000000#001#000000 modifiersName: cn=config modifyTimestamp: 20090211053747Z contextCSN: 20090211053750.548808Z#000000#001#000000 contextCSN:: fH/zsAAkIZgAJAtgAEP0QABD8nAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
On Wed, 11 Feb 2009, Howard Chu wrote:
I've added some checks/asserts to syncrepl.c and syncprov.c in HEAD. Can you please incorporate those asserts into your build and try to reproduce this? So far I've run test050 50 times and not seen any problem.
RE24 changed to:
pkg/ldap/servers/slapd/syncrepl.c,v 1.444 2009/02/11 21:00:00 pkg/ldap/servers/slapd/overlays/syncprov.c,v 1.257 2009/02/11 21:00:00
from HEAD. No asserts, and it looks like the same issue (although I could just be reading it wrong pre-coffee). This happened once during an overnight infinite loop (so figure maybe once in ~400-500 runs). testrun:
https://www.nbcs.rutgers.edu/~richton/test050failed.2009022322.tar.bz2
Aaron Richton wrote:
On Thu, 12 Feb 2009, Aaron Richton wrote:
from HEAD. No asserts, and it looks like the same issue (although I could
Uh, oops. I'll recompile with CHECK_CSN now. Right.
I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?
On Thu, 12 Feb 2009, Howard Chu wrote:
I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?
This is Solaris 9 on sparc; db4 4.7.25 patched.
Running all night didn't fail (assertion or not) a single time. If I only caught it once, I'd blame cosmic rays. Twice kind of bothers me, though.
Aaron Richton wrote:
On Thu, 12 Feb 2009, Howard Chu wrote:
I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?
This is Solaris 9 on sparc; db4 4.7.25 patched.
Running all night didn't fail (assertion or not) a single time. If I only caught it once, I'd blame cosmic rays. Twice kind of bothers me, though.
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.
On Fri, 13 Feb 2009, Howard Chu wrote:
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.
So I was about to write my "I agree" e-mail, but I caught:
1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));
see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.
Aaron Richton wrote:
On Fri, 13 Feb 2009, Howard Chu wrote:
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.
So I was about to write my "I agree" e-mail, but I caught:
1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));
see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.
Added a few more asserts to HEAD syncprov.c, let's see if we can narrow this down any further. There aren't any other places that set si_si_ctxcsn, so if the closing assert triggers but not the earlier ones, then we probably have an overwrite problem. If the other asserts trigger instead, then we're being fed bad data from something further upstream.
Aaron Richton wrote:
On Fri, 13 Feb 2009, Howard Chu wrote:
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.
So I was about to write my "I agree" e-mail, but I caught:
1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));
see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.
This should now be fixed in HEAD.
On Sun, 15 Feb 2009, Howard Chu wrote:
This should now be fixed in HEAD.
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Aaron Richton wrote:
On Sun, 15 Feb 2009, Howard Chu wrote:
This should now be fixed in HEAD.
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
I've got two servers set up running 1000 iterations each, will see how things go tomorrow.
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
Aaron Richton richton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
And in slapd.1.log ... end_ldap_response: msgid=436 tag=101 err=0 ber_flush2: 15 bytes to sd 24 <= hdb_dn2id_add 0x94: -30994 hdb_add: dn2id_add failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) delay = 668, num_retries = 1 <= hdb_dn2id_delete 0x90: 0 => index_entry_del( 144, "cn=James A Jones 3,ou=Alumni Association,ou=People,dc=example,dc=com" ) => key_change(DELETE,90) <= bdb_index_read: failed (-30988) ...
-Dieter
Dieter Kluenter wrote:
Aaron Richtonrichton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
Which is pretty squarely inside BerkeleyDB. But nobody else reported any such problem yet. Seems you'll have to check your BDB build.
And in slapd.1.log ... end_ldap_response: msgid=436 tag=101 err=0 ber_flush2: 15 bytes to sd 24 <= hdb_dn2id_add 0x94: -30994 hdb_add: dn2id_add failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) delay = 668, num_retries = 1 <= hdb_dn2id_delete 0x90: 0 => index_entry_del( 144, "cn=James A Jones 3,ou=Alumni Association,ou=People,dc=example,dc=com" ) => key_change(DELETE,90) <= bdb_index_read: failed (-30988) ...
-Dieter
Dieter Kluenter wrote:
Aaron Richtonrichton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
It appears you need Oracle's patch #3.
http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...
Howard Chu hyc@symas.com writes:
Dieter Kluenter wrote:
Aaron Richtonrichton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
It appears you need Oracle's patch #3.
http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.
-Dieter
--On Monday, February 16, 2009 8:19 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.
I'd suggest re-building BDB anyway to be sure it applied, and then testing. You are the only one seeing this, and many people have been testing RE24/2.4.14.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount quanah@zimbra.com writes:
--On Monday, February 16, 2009 8:19 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.
I'd suggest re-building BDB anyway to be sure it applied, and then testing. You are the only one seeing this, and many people have been testing RE24/2.4.14.
And how many of those reported success is based on db-4.7 and amd64, with a multifold of iterations on test039? I have done some 200 iterations now, with a success/failure ratio of 5:1
-Dieter
--On Monday, February 16, 2009 9:57 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:
And how many of those reported success is based on db-4.7 and amd64, with a multifold of iterations on test039? I have done some 200 iterations now, with a success/failure ratio of 5:1
Quite a few are using BDB 4.7 these days, although I don't know what CPU. My test box is BDB 4.7.25 with all 3 patches from Oracle (Do not use the one shipped with OL 2.4.12-2.4.13), and has 64-bit AMD CPUs.
What slapd backend are you using to get the 5:1 ratio? back-bdb? back-hdb?
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Dieter Kluenter wrote:
Howard Chuhyc@symas.com writes:
Dieter Kluenter wrote:
Aaron Richtonrichton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
It appears you need Oracle's patch #3.
http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.
The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.
--On Monday, February 16, 2009 3:35 PM -0800 Howard Chu hyc@symas.com wrote:
The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.
I finally got test039 to crash for me, too. Unfortunately, I was on a different system than I usually use, and it didn't have core files enabled. I'm attempting to re-trigger the event.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Howard Chu hyc@symas.com writes:
Dieter Kluenter wrote:
Howard Chuhyc@symas.com writes:
Dieter Kluenter wrote:
Aaron Richtonrichton@nbcs.rutgers.edu writes:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
I still get the already reported core dump on test039.
Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm
It appears you need Oracle's patch #3.
http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.
The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.
mea culpa, I wasn't aware that there were three patches already.
-Dieter
Aaron Richton wrote:
On Sun, 15 Feb 2009, Aaron Richton wrote:
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53
We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...
Ran fine overnight here.
Same here, on a server with BDB 4.6.21 and a server with BDB 4.7.25.
With test050 and defined(HAVE_EPOLL), valgrind sometimes reports invalid reads/writes in slapd_daemon_task(). Maybe that clobbers some memory (like the garbage contextCSN mentioned previously).
Turning SLAP_EPOLL_SOCK_SET(s, mode) into a function shows that s == 9 but SLAP_EPOLL_SOCK_IX(s) == -1, so slap_daemon.sd_epolls[] gets a negative index.
I don't know where to go from there. Maybe hangup (below) doesn't clean up after itself.
static void SLAP_EPOLL_SOCK_SET(int s, long mode) { long ix = SLAP_EPOLL_SOCK_IX(s); assert(ix >= 0); long ev = (slap_daemon.sd_epolls[ix]).events; if ( (ev & (mode)) != (mode) ) { SLAP_EPOLL_SOCK_EV(s) |= (mode); epoll_ctl( slap_daemon.sd_epfd, EPOLL_CTL_MOD, (s), &SLAP_EPOLL_SOCK_EP(s) ); } }
(gdb) thread 3 [Switching to thread 3 (process 3915)]#3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix >= 0); (gdb) backtrace #0 0x000000350e230215 in raise () from /lib64/libc.so.6 #1 0x000000350e231cc0 in abort () from /lib64/libc.so.6 #2 0x000000350e229696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 #5 0x000000350ee06367 in start_thread () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix >= 0);
(gdb) print ix $9 = -1 (gdb) print *slap_daemon.sd_index@128 $10 = {-1, -1, -1, -1, 0, -1, -1, 1, -1 <repeats 120 times>}
(gdb) up #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); (gdb) list 2552 Debug( LDAP_DEBUG_CONNS, 2553 "daemon: hangup on %d\n", fd, 0, 0 ); 2554 #ifdef HAVE_EPOLL 2555 /* Don't keep reporting the hangup 2556 */ 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); 2558 #endif
Hallvard B Furuseth wrote:
With test050 and defined(HAVE_EPOLL), valgrind sometimes reports invalid reads/writes in slapd_daemon_task(). Maybe that clobbers some memory (like the garbage contextCSN mentioned previously).
Turning SLAP_EPOLL_SOCK_SET(s, mode) into a function shows that s == 9 but SLAP_EPOLL_SOCK_IX(s) == -1, so slap_daemon.sd_epolls[] gets a negative index.
I don't know where to go from there. Maybe hangup (below) doesn't clean up after itself.
That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be? Have you seen this with CONNS debug messages enabled?
static void SLAP_EPOLL_SOCK_SET(int s, long mode) { long ix = SLAP_EPOLL_SOCK_IX(s); assert(ix>= 0); long ev = (slap_daemon.sd_epolls[ix]).events; if ( (ev& (mode)) != (mode) ) { SLAP_EPOLL_SOCK_EV(s) |= (mode); epoll_ctl( slap_daemon.sd_epfd, EPOLL_CTL_MOD, (s), &SLAP_EPOLL_SOCK_EP(s) ); } }
(gdb) thread 3 [Switching to thread 3 (process 3915)]#3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix>= 0); (gdb) backtrace #0 0x000000350e230215 in raise () from /lib64/libc.so.6 #1 0x000000350e231cc0 in abort () from /lib64/libc.so.6 #2 0x000000350e229696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 #5 0x000000350ee06367 in start_thread () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix>= 0);
(gdb) print ix $9 = -1 (gdb) print *slap_daemon.sd_index@128 $10 = {-1, -1, -1, -1, 0, -1, -1, 1, -1<repeats 120 times>}
(gdb) up #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); (gdb) list 2552 Debug( LDAP_DEBUG_CONNS, 2553 "daemon: hangup on %d\n", fd, 0, 0 ); 2554 #ifdef HAVE_EPOLL 2555 /* Don't keep reporting the hangup 2556 */ 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); 2558 #endif
Howard Chu writes:
That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be?
OS bug? Race condition? Valgrind bug, since it so far will only happen under Valgrind?
Have you seen this with CONNS debug messages enabled?
Full testrun dir in http://folk.uio.no/hbf/testrun.1.tbz from GRIND=valgrind SLAPD_DEBUG=269 ./run test050 (I've done s/exec/exec $GRIND/ on last 'exec' in servers/slapd/slapd).
Seems to usually be in slapd.2.log. It ends with:
conn=0 op=1 SEARCH RESULT tag=101 err=0 nentries=4 text= daemon: activity on 1 descriptor daemon: activity on: 9r daemon: read active on 9 daemon: epoll: listen=7 active_threads=0 tvp=NULL connection_get(9) connection_get(9): got connid=0 connection_read(9): checking for input on id=0 ber_get_next daemon: activity on 1 descriptor daemon: activity on: daemon: hangup on 9 connection_get(9) ber_get_next: tag 0x30 len 5 contents: ber_get_next ber_get_next on fd 9 failed errno=0 (Success) connection_read(9): input error=-2 id=0, closing. connection_closing: readying conn=0 sd=9 for close connection_close: deferring conn=0 sd=9 conn=0 op=2 do_unbind conn=0 op=2 UNBIND connection_get(9): got connid=0 connection_close: deferring conn=0 sd=9 daemon: epoll: listen=7 active_threads=0 tvp=NULL connection_resched: attempting closing conn=0 sd=9 connection_close: conn=0 sd=9 daemon: removing 9 daemon: activity on 2 descriptors daemon: activity on: conn=0 fd=9 closed daemon: hangup on 9 lt-slapd: daemon.c:181: SLAP_EPOLL_SOCK_SET: Assertion `ix >= 0' failed.
$ grep '^conn=' slapd.2.log conn=0 fd=9 ACCEPT from IP=127.0.0.1:46366 (IP=127.0.0.1:9012) conn=0 op=0 BIND dn="cn=config" method=128 conn=0 op=0 BIND dn="cn=config" mech=SIMPLE ssf=0 conn=0 op=0 RESULT tag=97 err=0 text= conn=0 op=1 do_search conn=0 op=1 SRCH base="cn=config" scope=2 deref=0 filter="(objectClass=*)" conn=0 op=1 SRCH attr=* + conn=0 op=1 SEARCH RESULT tag=101 err=0 nentries=4 text= conn=0 op=2 do_unbind conn=0 op=2 UNBIND conn=0 fd=9 closed
I wrote:
Howard Chu writes:
That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be?
OS bug? Race condition? Valgrind bug, since it so far will only happen under Valgrind?
Happened without Valgrind too, though it took 55 runs to trigger. Same backtrace except this time it's not on the sole connection:
sd=11 which is conn=4. *slap_daemon.sd_index@128 = {-1, -1, -1, -1, 0, -1, -1, 1, -1, 2, 3, -1, -1, -1, -1, -1, -1, -1, 4, -1 <repeats 109 times>}.
Activity on another connection (conn=5) between the conn=4 UNBIND and the crash. Testrun dir in http://folk.uio.no/hbf/testrun.2.tbz.
It has not happened so far on 2.4.13.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
--On February 9, 2009 8:11:20 AM -0800 Quanah Gibson-Mount <quanah@zimbra.com
wrote:
Please test current RE24 CVS. In particular, if people can test:
(a) Building without TLS & without sasl (b) Building without TLS (c) Building without sasl
All these configurations (and the normal full configuration) run through the tests without error on OS X 10.5.6/Intel.
jens