2.4.14 prerelease call for testing #2

List overview All Threads
Download

newer

older

N-Way docs

Old ITSs again

Quanah Gibson-Mount

10 Feb 2009 10 Feb '09

4:35 p.m.

Several more fixes completed, please test.

Thanks!

Still waiting on confirmation from someone with gnutls. ;)

--Quanah

--On February 9, 2009 8:11:20 AM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:

...

Please test current RE24 CVS. In particular, if people can test:

(a) Building without TLS & without sasl (b) Building without TLS (c) Building without sasl (d) If at least one person can build against GnuTLS instead of OpenSSL

Much appreciated. :)

Thanks,

--Quanah

--

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc

Zimbra :: the leader in open source messaging and collaboration

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Show replies by date

Aaron Richton

11 Feb 11 Feb

6:08 a.m.

On Tue, 10 Feb 2009, Quanah Gibson-Mount wrote:

...

Several more fixes completed, please test.

Fails test050.

<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)" slapd destroy: freeing system resources. slapd stopped. connections_destroy: nothing to destroy.

testrun dir: https://www.nbcs.rutgers.edu/~richton/test050.fail.RE24-20090210.tar.bz2

...

Still waiting on confirmation from someone with gnutls. ;)

During the original call, I made one run through without SASL nor OpenSSL. I've switched back to a full-blown configuration at this point.

Hallvard B Furuseth

6:17 a.m.

Aaron Richton writes:

...

<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)"

The 2nd contextCSN in that entry is garbage:

dn: cn=config objectClass: olcGlobal cn: config structuralObjectClass: olcGlobal creatorsName: cn=config createTimestamp: 20090211053745Z entryUUID: dade8f04-8c49-102d-935b-1f79305f5c76 olcServerID: 1 ldap://localhost:9011/ olcServerID: 2 ldap://localhost:9012/ olcServerID: 3 ldap://localhost:9013/ entryCSN: 20090211053747.167910Z#000000#001#000000 modifiersName: cn=config modifyTimestamp: 20090211053747Z contextCSN: 20090211053750.548808Z#000000#001#000000 contextCSN:: fH/zsAAkIZgAJAtgAEP0QABD8nAAAAAAAAAAAAAAAAAAAAAAAAAAAA==

-- Hallvard

Howard Chu

1:01 p.m.

Hallvard B Furuseth wrote:

...

Aaron Richton writes:

...
<= str2entry NULL (smr_normalize contextCSN 21) send_ldap_result: conn=-1 op=0 p=0 send_ldap_result: err=80 matched="" text="internal error (cannot parse some entry file)"

The 2nd contextCSN in that entry is garbage:

Looking at the log file, it appears to have been corrupted during the checkpoint on the previous shutdown.

I've added some checks/asserts to syncrepl.c and syncprov.c in HEAD. Can you please incorporate those asserts into your build and try to reproduce this? So far I've run test050 50 times and not seen any problem.

...

dn: cn=config objectClass: olcGlobal cn: config structuralObjectClass: olcGlobal creatorsName: cn=config createTimestamp: 20090211053745Z entryUUID: dade8f04-8c49-102d-935b-1f79305f5c76 olcServerID: 1 ldap://localhost:9011/ olcServerID: 2 ldap://localhost:9012/ olcServerID: 3 ldap://localhost:9013/ entryCSN: 20090211053747.167910Z#000000#001#000000 modifiersName: cn=config modifyTimestamp: 20090211053747Z contextCSN: 20090211053750.548808Z#000000#001#000000 contextCSN:: fH/zsAAkIZgAJAtgAEP0QABD8nAAAAAAAAAAAAAAAAAAAAAAAAAAAA==

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

12 Feb 12 Feb

4:34 a.m.

On Wed, 11 Feb 2009, Howard Chu wrote:

...

I've added some checks/asserts to syncrepl.c and syncprov.c in HEAD. Can you please incorporate those asserts into your build and try to reproduce this? So far I've run test050 50 times and not seen any problem.

RE24 changed to:

pkg/ldap/servers/slapd/syncrepl.c,v 1.444 2009/02/11 21:00:00 pkg/ldap/servers/slapd/overlays/syncprov.c,v 1.257 2009/02/11 21:00:00

from HEAD. No asserts, and it looks like the same issue (although I could just be reading it wrong pre-coffee). This happened once during an overnight infinite loop (so figure maybe once in ~400-500 runs). testrun:

https://www.nbcs.rutgers.edu/~richton/test050failed.2009022322.tar.bz2

Aaron Richton

7:02 a.m.

On Thu, 12 Feb 2009, Aaron Richton wrote:

...

from HEAD. No asserts, and it looks like the same issue (although I could

Uh, oops. I'll recompile with CHECK_CSN now. Right.

Howard Chu

6:10 p.m.

Aaron Richton wrote:

...

On Thu, 12 Feb 2009, Aaron Richton wrote:

...
from HEAD. No asserts, and it looks like the same issue (although I could

Uh, oops. I'll recompile with CHECK_CSN now. Right.

I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

13 Feb 13 Feb

6:28 a.m.

On Thu, 12 Feb 2009, Howard Chu wrote:

...

I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?

This is Solaris 9 on sparc; db4 4.7.25 patched.

Running all night didn't fail (assertion or not) a single time. If I only caught it once, I'd blame cosmic rays. Twice kind of bothers me, though.

Howard Chu

4:28 p.m.

Aaron Richton wrote:

...

On Thu, 12 Feb 2009, Howard Chu wrote:

...
I've run about 600 iterations each on two separate servers all day today, no asserts. Tell me again what your system environment is?

This is Solaris 9 on sparc; db4 4.7.25 patched.

Running all night didn't fail (assertion or not) a single time. If I only caught it once, I'd blame cosmic rays. Twice kind of bothers me, though.

Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

15 Feb 15 Feb

10:35 a.m.

On Fri, 13 Feb 2009, Howard Chu wrote:

...

Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.

So I was about to write my "I agree" e-mail, but I caught:

1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));

see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.

Howard Chu

2:19 p.m.

Aaron Richton wrote:

...

On Fri, 13 Feb 2009, Howard Chu wrote:

...
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.

So I was about to write my "I agree" e-mail, but I caught:

1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));

see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.

Added a few more asserts to HEAD syncprov.c, let's see if we can narrow this down any further. There aren't any other places that set si_si_ctxcsn, so if the closing assert triggers but not the earlier ones, then we probably have an overwrite problem. If the other asserts trigger instead, then we're being fed bad data from something further upstream.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

3:13 p.m.

Aaron Richton wrote:

...

On Fri, 13 Feb 2009, Howard Chu wrote:

...
Agreed. But at this point, 2.4.14 is such an improvement over .13 that I'd rather get it out instead of delay it further. We can revisit this for .15 when we get some better traces of the problem.

So I was about to write my "I agree" e-mail, but I caught:

1347 assert( !syn->ssyn_validate( syn, si->si_ctxcsn+i ));

see https://www.nbcs.rutgers.edu/~richton/test050fail.200902142000 for backtrace.

This should now be fixed in HEAD.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

3:47 p.m.

On Sun, 15 Feb 2009, Howard Chu wrote:

...

This should now be fixed in HEAD.

I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Howard Chu

4:46 p.m.

Aaron Richton wrote:

...

On Sun, 15 Feb 2009, Howard Chu wrote:

...
This should now be fixed in HEAD.

I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

I've got two servers set up running 1000 iterations each, will see how things go tomorrow.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

16 Feb 16 Feb

6:06 a.m.

On Sun, 15 Feb 2009, Aaron Richton wrote:

...

I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

Dieter Kluenter

7:14 a.m.

Aaron Richton richton@nbcs.rutgers.edu writes:

...

On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

And in slapd.1.log ... end_ldap_response: msgid=436 tag=101 err=0 ber_flush2: 15 bytes to sd 24 <= hdb_dn2id_add 0x94: -30994 hdb_add: dn2id_add failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) delay = 668, num_retries = 1 <= hdb_dn2id_delete 0x90: 0 => index_entry_del( 144, "cn=James A Jones 3,ou=Alumni Association,ou=People,dc=example,dc=com" ) => key_change(DELETE,90) <= bdb_index_read: failed (-30988) ...

-Dieter

-- Dieter Klünter | Systemberatung http://www.dpunkt.de/buecher/2104.html sip: +49.180.1555.7770535 GPG Key ID:8EF7B6C6 53°08'09,95"N 10°08'02,42"E

Howard Chu

7:26 a.m.

Dieter Kluenter wrote:

...

Aaron Richtonrichton@nbcs.rutgers.edu writes:

...
On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

Which is pretty squarely inside BerkeleyDB. But nobody else reported any such problem yet. Seems you'll have to check your BDB build.

...

And in slapd.1.log ... end_ldap_response: msgid=436 tag=101 err=0 ber_flush2: 15 bytes to sd 24 <= hdb_dn2id_add 0x94: -30994 hdb_add: dn2id_add failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) delay = 668, num_retries = 1 <= hdb_dn2id_delete 0x90: 0 => index_entry_del( 144, "cn=James A Jones 3,ou=Alumni Association,ou=People,dc=example,dc=com" ) => key_change(DELETE,90) <= bdb_index_read: failed (-30988) ...

-Dieter

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

10:37 a.m.

Dieter Kluenter wrote:

...

Aaron Richtonrichton@nbcs.rutgers.edu writes:

...
On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

It appears you need Oracle's patch #3.

http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Dieter Kluenter

11:19 a.m.

Howard Chu hyc@symas.com writes:

...

Dieter Kluenter wrote:

...
Aaron Richtonrichton@nbcs.rutgers.edu writes:

...
On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

It appears you need Oracle's patch #3.

http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...

This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.

-Dieter

-- Dieter Klünter | Systemberatung http://www.dpunkt.de/buecher/2104.html sip: +49.180.1555.7770535 GPG Key ID:8EF7B6C6 53°08'09,95"N 10°08'02,42"E

Quanah Gibson-Mount

11:38 a.m.

--On Monday, February 16, 2009 8:19 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:

...

This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.

I'd suggest re-building BDB anyway to be sure it applied, and then testing. You are the only one seeing this, and many people have been testing RE24/2.4.14.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Dieter Kluenter

12:57 p.m.

Quanah Gibson-Mount quanah@zimbra.com writes:

...

--On Monday, February 16, 2009 8:19 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:

...
This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.

I'd suggest re-building BDB anyway to be sure it applied, and then testing. You are the only one seeing this, and many people have been testing RE24/2.4.14.

And how many of those reported success is based on db-4.7 and amd64, with a multifold of iterations on test039? I have done some 200 iterations now, with a success/failure ratio of 5:1

-Dieter

-- Dieter Klünter | Systemberatung http://www.dpunkt.de/buecher/2104.html sip: +49.180.1555.7770535 GPG Key ID:8EF7B6C6 53°08'09,95"N 10°08'02,42"E

Quanah Gibson-Mount

3:08 p.m.

--On Monday, February 16, 2009 9:57 PM +0100 Dieter Kluenter dieter@dkluenter.de wrote:

...

And how many of those reported success is based on db-4.7 and amd64, with a multifold of iterations on test039? I have done some 200 iterations now, with a success/failure ratio of 5:1

Quite a few are using BDB 4.7 these days, although I don't know what CPU. My test box is BDB 4.7.25 with all 3 patches from Oracle (Do not use the one shipped with OL 2.4.12-2.4.13), and has 64-bit AMD CPUs.

What slapd backend are you using to get the 5:1 ratio? back-bdb? back-hdb?

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Howard Chu

3:35 p.m.

Dieter Kluenter wrote:

...

Howard Chuhyc@symas.com writes:

...
Dieter Kluenter wrote:

...
Aaron Richtonrichton@nbcs.rutgers.edu writes:

...
On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

It appears you need Oracle's patch #3.

http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...

This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.

The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Quanah Gibson-Mount

17 Feb 17 Feb

8:54 a.m.

New subject: test039 crashes

--On Monday, February 16, 2009 3:35 PM -0800 Howard Chu hyc@symas.com wrote:

...

The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.

I finally got test039 to crash for me, too. Unfortunately, I was on a different system than I usually use, and it didn't have core files enabled. I'm attempting to re-trigger the event.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Dieter Kluenter

18 Feb 18 Feb

8:12 a.m.

Howard Chu hyc@symas.com writes:

...

Dieter Kluenter wrote:

...
Howard Chuhyc@symas.com writes:

...
Dieter Kluenter wrote:

...
Aaron Richtonrichton@nbcs.rutgers.edu writes:

...
On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

I still get the already reported core dump on test039.

Core was generated by `/home/dieter/openldap/servers/slapd/.libs/lt-slapd -s0 -f /home/dieter/openldap'. Program terminated with signal 11, Segmentation fault. #0 0x00002b77bc3da12d in __lock_detect () from /usr/local/BerkeleyDB.4.7/lib/libdb-4.7.so Current language: auto; currently asm

It appears you need Oracle's patch #3.

http://www.oracle.com/technology/products/berkeley-db/db/update/4.7.25/patch...

This patch had been applied already, as I have mentioned http://www.openldap.org/lists/openldap-devel/200902/msg00031.html still, one out of 5 testruns fails.

The email you reference above only says you have patch 4.7.25.1. As I said, you need patch #3. So, unless you made a mistake in your previous email, I see no evidence that you have the necessary patch.

mea culpa, I wasn't aware that there were three patches already.

-Dieter

-- Dieter Klünter | Systemberatung http://www.dpunkt.de/buecher/2104.html sip: +49.180.1555.7770535 GPG Key ID:8EF7B6C6 53°08'09,95"N 10°08'02,42"E

Howard Chu

16 Feb 16 Feb

11:08 a.m.

Aaron Richton wrote:

...

On Sun, 15 Feb 2009, Aaron Richton wrote:

...
I'll start a RE24 test050 loop overnight with syncrepl.c.1.444 syncprov.c.1.261 ctxcsn.c.1.53

We know that this is a pretty rare issue; hopefully some others on the list will join in so we can try and get the run count high...

Ran fine overnight here.

Same here, on a server with BDB 4.6.21 and a server with BDB 4.7.25.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard B Furuseth

11 Feb 11 Feb

11:09 a.m.

With test050 and defined(HAVE_EPOLL), valgrind sometimes reports invalid reads/writes in slapd_daemon_task(). Maybe that clobbers some memory (like the garbage contextCSN mentioned previously).

Turning SLAP_EPOLL_SOCK_SET(s, mode) into a function shows that s == 9 but SLAP_EPOLL_SOCK_IX(s) == -1, so slap_daemon.sd_epolls[] gets a negative index.

I don't know where to go from there. Maybe hangup (below) doesn't clean up after itself.

static void SLAP_EPOLL_SOCK_SET(int s, long mode) { long ix = SLAP_EPOLL_SOCK_IX(s); assert(ix >= 0); long ev = (slap_daemon.sd_epolls[ix]).events; if ( (ev & (mode)) != (mode) ) { SLAP_EPOLL_SOCK_EV(s) |= (mode); epoll_ctl( slap_daemon.sd_epfd, EPOLL_CTL_MOD, (s), &SLAP_EPOLL_SOCK_EP(s) ); } }

(gdb) thread 3 [Switching to thread 3 (process 3915)]#3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix >= 0); (gdb) backtrace #0 0x000000350e230215 in raise () from /lib64/libc.so.6 #1 0x000000350e231cc0 in abort () from /lib64/libc.so.6 #2 0x000000350e229696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 #5 0x000000350ee06367 in start_thread () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix >= 0);

(gdb) print ix $9 = -1 (gdb) print *slap_daemon.sd_index@128 $10 = {-1, -1, -1, -1, 0, -1, -1, 1, -1 <repeats 120 times>}

(gdb) up #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); (gdb) list 2552 Debug( LDAP_DEBUG_CONNS, 2553 "daemon: hangup on %d\n", fd, 0, 0 ); 2554 #ifdef HAVE_EPOLL 2555 /* Don't keep reporting the hangup 2556 */ 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); 2558 #endif

-- Hallvard

Howard Chu

12:17 p.m.

Hallvard B Furuseth wrote:

...

With test050 and defined(HAVE_EPOLL), valgrind sometimes reports invalid reads/writes in slapd_daemon_task(). Maybe that clobbers some memory (like the garbage contextCSN mentioned previously).

Turning SLAP_EPOLL_SOCK_SET(s, mode) into a function shows that s == 9 but SLAP_EPOLL_SOCK_IX(s) == -1, so slap_daemon.sd_epolls[] gets a negative index.

I don't know where to go from there. Maybe hangup (below) doesn't clean up after itself.

That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be? Have you seen this with CONNS debug messages enabled?

...

static void SLAP_EPOLL_SOCK_SET(int s, long mode) { long ix = SLAP_EPOLL_SOCK_IX(s); assert(ix>= 0); long ev = (slap_daemon.sd_epolls[ix]).events; if ( (ev& (mode)) != (mode) ) { SLAP_EPOLL_SOCK_EV(s) |= (mode); epoll_ctl( slap_daemon.sd_epfd, EPOLL_CTL_MOD, (s), &SLAP_EPOLL_SOCK_EP(s) ); } }

(gdb) thread 3 [Switching to thread 3 (process 3915)]#3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix>= 0); (gdb) backtrace #0 0x000000350e230215 in raise () from /lib64/libc.so.6 #1 0x000000350e231cc0 in abort () from /lib64/libc.so.6 #2 0x000000350e229696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 #5 0x000000350ee06367 in start_thread () from /lib64/libpthread.so.0 (gdb) frame 3 #3 0x0000000000419b9c in SLAP_EPOLL_SOCK_SET (s=9, mode=-2147483648) at daemon.c:181 181 assert(ix>= 0);

(gdb) print ix $9 = -1 (gdb) print *slap_daemon.sd_index@128 $10 = {-1, -1, -1, -1, 0, -1, -1, 1, -1<repeats 120 times>}

(gdb) up #4 0x000000000041ef40 in slapd_daemon_task (ptr=0x0) at daemon.c:2557 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); (gdb) list 2552 Debug( LDAP_DEBUG_CONNS, 2553 "daemon: hangup on %d\n", fd, 0, 0 ); 2554 #ifdef HAVE_EPOLL 2555 /* Don't keep reporting the hangup 2556 */ 2557 SLAP_EPOLL_SOCK_SET( fd, EPOLLET ); 2558 #endif

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard B Furuseth

12:59 p.m.

Howard Chu writes:

...

That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be?

OS bug? Race condition? Valgrind bug, since it so far will only happen under Valgrind?

...

Have you seen this with CONNS debug messages enabled?

Full testrun dir in http://folk.uio.no/hbf/testrun.1.tbz from GRIND=valgrind SLAPD_DEBUG=269 ./run test050 (I've done s/exec/exec $GRIND/ on last 'exec' in servers/slapd/slapd).

Seems to usually be in slapd.2.log. It ends with:

conn=0 op=1 SEARCH RESULT tag=101 err=0 nentries=4 text= daemon: activity on 1 descriptor daemon: activity on: 9r daemon: read active on 9 daemon: epoll: listen=7 active_threads=0 tvp=NULL connection_get(9) connection_get(9): got connid=0 connection_read(9): checking for input on id=0 ber_get_next daemon: activity on 1 descriptor daemon: activity on: daemon: hangup on 9 connection_get(9) ber_get_next: tag 0x30 len 5 contents: ber_get_next ber_get_next on fd 9 failed errno=0 (Success) connection_read(9): input error=-2 id=0, closing. connection_closing: readying conn=0 sd=9 for close connection_close: deferring conn=0 sd=9 conn=0 op=2 do_unbind conn=0 op=2 UNBIND connection_get(9): got connid=0 connection_close: deferring conn=0 sd=9 daemon: epoll: listen=7 active_threads=0 tvp=NULL connection_resched: attempting closing conn=0 sd=9 connection_close: conn=0 sd=9 daemon: removing 9 daemon: activity on 2 descriptors daemon: activity on: conn=0 fd=9 closed daemon: hangup on 9 lt-slapd: daemon.c:181: SLAP_EPOLL_SOCK_SET: Assertion `ix >= 0' failed.

$ grep '^conn=' slapd.2.log conn=0 fd=9 ACCEPT from IP=127.0.0.1:46366 (IP=127.0.0.1:9012) conn=0 op=0 BIND dn="cn=config" method=128 conn=0 op=0 BIND dn="cn=config" mech=SIMPLE ssf=0 conn=0 op=0 RESULT tag=97 err=0 text= conn=0 op=1 do_search conn=0 op=1 SRCH base="cn=config" scope=2 deref=0 filter="(objectClass=*)" conn=0 op=1 SRCH attr=* + conn=0 op=1 SEARCH RESULT tag=101 err=0 nentries=4 text= conn=0 op=2 do_unbind conn=0 op=2 UNBIND conn=0 fd=9 closed

-- Hallvard

Hallvard B Furuseth

12 Feb 12 Feb

7:08 a.m.

I wrote:

...

Howard Chu writes:

...
That's interesting. That would require receiving a Hangup event on a descriptor we've already removed from epoll's event list. How can that be?

OS bug? Race condition? Valgrind bug, since it so far will only happen under Valgrind?

Happened without Valgrind too, though it took 55 runs to trigger. Same backtrace except this time it's not on the sole connection:

sd=11 which is conn=4. *slap_daemon.sd_index@128 = {-1, -1, -1, -1, 0, -1, -1, 1, -1, 2, 3, -1, -1, -1, -1, -1, -1, -1, 4, -1 <repeats 109 times>}.

Activity on another connection (conn=5) between the conn=4 UNBIND and the crash. Testrun dir in http://folk.uio.no/hbf/testrun.2.tbz.

It has not happened so far on 2.4.13.

-- Hallvard

Jens Vagelpohl

4:07 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

...

--On February 9, 2009 8:11:20 AM -0800 Quanah Gibson-Mount <quanah@zimbra.com

...
wrote:

...
Please test current RE24 CVS. In particular, if people can test:

(a) Building without TLS & without sasl (b) Building without TLS (c) Building without sasl

All these configurations (and the normal full configuration) run through the tests without error on OS X 10.5.6/Intel.

jens

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkmUEOcACgkQRAx5nvEhZLJlOwCgsKTSf8TNB6tbLdDqPoZpgVgM vFYAoI5jTrv8cxr3taCv3o/tzK/DAhtr =5oan -----END PGP SIGNATURE-----

5980

Age (days ago)

5987

Last active (days ago)

openldap-devel@openldap.org

30 comments

6 participants

tags (0)

participants (6)

Aaron Richton
Dieter Kluenter
Hallvard B Furuseth
Howard Chu
Jens Vagelpohl
Quanah Gibson-Mount