RE24 connection code reworking

List overview All Threads
Download

newer

older

RE24 connection code round 2

autoconf

Quanah Gibson-Mount

23 Jan 2009 23 Jan '09

9:43 a.m.

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.

Thanks, Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Show replies by date

Jonathan Clarke

23 Jan 23 Jan

5:05 p.m.

Quanah Gibson-Mount wrote:

...

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.

All tests passed OK on RE24 from an hour ago here on Ubuntu 8.10 (i386).

Regards, Jonathan

-- Jonathan Clarke -- LinID - Open Source Identity Management --------------------------------------------------------------- Linagora 27 rue de Berri, 75008 Paris Tel: 01 58 18 68 28 / 06 99 60 03 10 --------------------------------------------------------------- Ldap Synchronization Connector (LSC) - http://lsc-project.org/ ---------------------------------------------------------------

Pierangelo Masarati

24 Jan 24 Jan

2:10 a.m.

Quanah Gibson-Mount wrote:

...

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.

I found two apparently identical core dumps (from HEAD). The second one was from test045 with hdb (all tests went fine with bdb, apparently). The first one I don't know, as it occurred 12 minutes earlier, the test didn't detect it and I didn't save the log info from the runs. The backtrace looks identical:

test045:

#0 0x0049f402 in __kernel_vsyscall () (gdb) bt #0 0x0049f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9635328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb63e11d0, arg=0x9635328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb63e11d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x9604cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6

unknown test:

#0 0x006d3402 in __kernel_vsyscall () (gdb) bt #0 0x006d3402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x963b050) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb355d1d0, arg=0x963b050) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb355d1d0, argv=0xc) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x960acd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6

I'm currently using CentOS 5.2 kernel: 2.6.18-92.1.10.el5 Oracle DB: 4.7.25

The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.

Ing. Pierangelo Masarati OpenLDAP Core Team

SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------

Pierangelo Masarati

2:16 a.m.

Pierangelo Masarati wrote:

...

Oracle DB: 4.7.25

The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.

Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.

Ing. Pierangelo Masarati OpenLDAP Core Team

Pierangelo Masarati

2:48 a.m.

Pierangelo Masarati wrote:

...

Pierangelo Masarati wrote:

...
Oracle DB: 4.7.25

The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.

Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.

Well, I didn't expect too much improvement from rebuilding the Berkeley DB, as the issue seemed to be unrelated, but I might have been wrong, as no issue appeared in HEAD after re-running the test quite a few times. I'll keep trying, then move to re24 and report.

Ing. Pierangelo Masarati OpenLDAP Core Team

Pierangelo Masarati

4:34 a.m.

Pierangelo Masarati wrote:

...

Pierangelo Masarati wrote:

...
Pierangelo Masarati wrote:

...
Oracle DB: 4.7.25

The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.

Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.

Well, I didn't expect too much improvement from rebuilding the Berkeley DB, as the issue seemed to be unrelated, but I might have been wrong, as no issue appeared in HEAD after re-running the test quite a few times. I'll keep trying, then move to re24 and report.

I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).

#0 0x0090f402 in __kernel_vsyscall () (gdb) bt #0 0x0090f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9214328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb4bfc1d0, arg=0x9214328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb4bfc1d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x91e3cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6

So I fear the problem is still there.

Ing. Pierangelo Masarati OpenLDAP Core Team

Pierangelo Masarati

4:43 a.m.

Pierangelo Masarati wrote:

...

I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).

#0 0x0090f402 in __kernel_vsyscall () (gdb) bt #0 0x0090f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9214328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb4bfc1d0, arg=0x9214328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb4bfc1d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x91e3cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6

So I fear the problem is still there.

As far as I understand, that assertion seems to be too strict. Apparently, someone else already removed the task. Perhaps, all calls to runqueue should be preceded by a check about the existence of the task itself. An excerpt of that slapd instance's logs indicates:

slap_queue_csn: queing 0x92a30c8 20090124113556.050820Z#000000#000#000000 do_syncrep2: rid=001 (-1) Can't contact LDAP server slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com " ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (3 retries left) connection_read(10): no connection! ... # hundreds of the above connection_read(10): no connection! slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com " ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (2 retries left) connection_read(10): no connection! ... # hundreds of the above ... connection_read(10): no connection! slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (2 retries left) slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (1 retries left) slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 quitting slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) slapd: ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92: ldap_pvt_runqueue_remove: Assertion `e == entry' failed.

Ing. Pierangelo Masarati OpenLDAP Core Team

Howard Chu

4:15 p.m.

Pierangelo Masarati wrote:

...

Pierangelo Masarati wrote:

...
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).

I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jens Vagelpohl

25 Jan 25 Jan

1:35 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jan 25, 2009, at 01:15 , Howard Chu wrote:

...

Pierangelo Masarati wrote:

...
Pierangelo Masarati wrote:

...
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).

I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.

test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.

jens

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkl8MmwACgkQRAx5nvEhZLITCwCffdqSQLh0FsvwGxC2KeBNcpSY ufUAn1eN/VWuM/Ee39TUmKxphb3OAY2W =MTyR -----END PGP SIGNATURE-----

Howard Chu

12:06 p.m.

Jens Vagelpohl wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jan 25, 2009, at 01:15 , Howard Chu wrote:

...
Pierangelo Masarati wrote:

...
Pierangelo Masarati wrote:

...
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).

I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.

test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.

Please provide a gdb snapshot of the hung process.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jens Vagelpohl

12:26 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jan 25, 2009, at 21:06 , Howard Chu wrote:

...

...
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.

Please provide a gdb snapshot of the hung process.

Is there documentation that explains how to do that?

jens

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkl8yukACgkQRAx5nvEhZLJ3nwCffa1Ta7g8MZpD7Agg8fmwMZ5L XwsAnjQW8okUtYJ2ZQYaFwohC+u+h0oa =m6Zo -----END PGP SIGNATURE-----

Quanah Gibson-Mount

3:06 p.m.

--On Sunday, January 25, 2009 9:26 PM +0100 Jens Vagelpohl jens@dataflake.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jan 25, 2009, at 21:06 , Howard Chu wrote:

...
...
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.

Please provide a gdb snapshot of the hung process.

Is there documentation that explains how to do that?

http://cs.baylor.edu/~donahoo/tools/gdb/tutorial.html

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Pierangelo Masarati

1:23 p.m.

No more failures of this kind; however, now I intermittently get replication failures:

Using ldapmodify to modify master directory... Waiting 5 seconds for syncrepl to receive changes... Waiting 5 seconds for syncrepl to receive changes... Waiting 5 seconds for syncrepl to receive changes... 2 < Comparing retrieved entries from master and slave... test failed - master and slave databases differ bash-3.2$ diff -u testrun/server1.out.2 testrun/server2.out.2 --- testrun/server1.out.2 2009-01-25 21:54:57.000000000 +0100 +++ testrun/server2.out.2 2009-01-25 21:54:57.000000000 +0100 @@ -61,11 +61,13 @@ =com objectClass: OpenLDAPperson cn: Bjorn Jensen +cn: Biiff Jensen sn: Jensen uid: bjorn seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com userPassword:: Ympvcm4= homePostalAddress: 19923 Seven Mile Rd. $ South Lyon, MI 49999 +drink: Iced Tea description: Hiker, biker title: Director, Embedded Systems postalAddress: Info Tech Division $ 535 W. William St. $ Anytown, MI 48103 @@ -74,7 +76,6 @@ pager: +1 313 555 4474 facsimileTelephoneNumber: +1 313 555 2177 telephoneNumber: +1 313 555 0355 -drink: Iced Tea

dn: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -108,23 +109,6 @@ telephoneNumber: +1 313 555 1817 associatedDomain: example.com

-dn: cn=Gern Jensen,ou=Information Technology Division,ou=People,dc=example,dc= - com -objectClass: OpenLDAPperson -cn: Gern Jensen -sn: Jensen -uid: gjensen -title: Chief Investigator, ITD -postalAddress: ITD $ 535 W. William St $ Ann Arbor, MI 48103 -seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com -drink: Coffee -homePostalAddress: 844 Brown St. Apt. 4 $ Ann Arbor, MI 48104 -description: Very odd -facsimileTelephoneNumber: +1 313 555 7557 -telephoneNumber: +1 313 555 8343 -mail: gjensen@mailgw.example.com -homePhone: +1 313 555 8844 - dn: ou=Groups,dc=example,dc=com objectClass: organizationalUnit ou: Groups @@ -278,18 +262,19 @@ cn: ITD Staff objectClass: groupOfUniqueNames uniqueMember: cn=Manager,dc=example,dc=com +uniqueMember: cn=Bjorn Jensen,ou=Information Technology Division,ou=People,dc= + example,dc=com +uniqueMember: cn=James A Jones 2,ou=Information Technology Division,ou=People, + dc=example,dc=com uniqueMember: cn=John Doe,ou=Information Technology Division,ou=People,dc=exam ple,dc=com -uniqueMember: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc - =com -uniqueMember: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc - =com

dn: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson cn: James A Jones 1 cn: James Jones cn: Jim Jones +sn: Jones uid: jaj postalAddress: Alumni Association $ 111 Maple St $ Anytown, MI 48109 seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com @@ -302,8 +287,25 @@ mail: jaj@mail.alumni.example.com facsimileTelephoneNumber: +1 313 555 4332 telephoneNumber: +1 313 555 0895 -drink: Orange Juice -sn: Jones + +dn: cn=James A Jones 2,ou=Information Technology Division,ou=People,dc=example + ,dc=com +objectClass: OpenLDAPperson +cn: James A Jones 2 +cn: James Jones +cn: Jim Jones +sn: Doe +uid: jjones +seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com +homePostalAddress: 933 Brooks $ Anytown, MI 48104 +homePhone: +1 313 555 8838 +title: Senior Manager, Information Technology Division +description: Not around very much +mail: jjones@mailgw.example.com +postalAddress: Info Tech Division $ 535 W William $ Anytown, MI 48103 +pager: +1 313 555 2833 +facsimileTelephoneNumber: +1 313 555 8688 +telephoneNumber: +1 313 555 7334

dn: cn=Jane Doe,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -387,17 +389,6 @@ objectClass: organizationalUnit ou: People

-dn: ou=Retired,ou=People,dc=example,dc=com -objectClass: organizationalUnit -ou: Retired - -dn: cn=Rosco P. Coltrane,ou=Retired,ou=People,dc=example,dc=com -objectClass: OpenLDAPperson -sn: Coltrane -uid: rosco -description: Fat tycoon -cn: Rosco P. Coltrane - dn: dc=testdomain1,dc=example,dc=com objectClass: domain dc: testdomain1

Still investigating...

Ing. Pierangelo Masarati OpenLDAP Core Team

Pierangelo Masarati

2:48 p.m.

Pierangelo Masarati wrote:

...

No more failures of this kind; however, now I intermittently get replication failures:

The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"

Ing. Pierangelo Masarati OpenLDAP Core Team

Howard Chu

26 Jan 26 Jan

1:58 p.m.

Pierangelo Masarati wrote:

...

Pierangelo Masarati wrote:

...
No more failures of this kind; however, now I intermittently get replication failures:

The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"

What kind of system are you running on? Linux / multiprocessor?

One of the problems with epoll() on Linux is that it wakes up for HANGUP events all the time (they are not selectable in the input options; they're delivered regardless of whether you choose to wait for them or not). This also means we can't shut the notifications off when we acknowledge/act on them. So you'll get lots of repeated wakeups for the same hangup event. The new connection_hangup() function processes these inline for normal connections, but it still falls into the connection_read thread handling for client connections, so their normal cleanup handlers can be invoked. If your server is too busy, it will take a while for the submitted thread to execute, and then you'll get a lot of these spurious messages.

I've been experimenting with epoll's edge-triggered and oneshot modes, which would prevent multiple wakeups occurring for the same event. But unfortunately, when I set that it seems that the events can't be *re-enabled* when we want them, and so slapd hangs. Still looking at this.

But that's beside the point - you shouldn't be seeing any replication failures at all, regardless of connection close handling. What else are you seeing now?

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

2:22 p.m.

Since test008 worked nicely overnight I decided to get a bit more aggressive. I don't think it's related to connection code, but with HEAD (large batch of ADDs; local smoke tests, not from ldap-src tests):

t@16 (l@16) terminated by signal SEGV (no mapping at the fault address) Current function is __ham_func4 (optimized) 177 HASH4; current thread: t@16 =>[1] __ham_func4(dbp = ???, key = ???, len = ???) (optimized), at 0xffffffff7af16818 (line ~177) in "hash_func.c" [2] __db_chksum(hdr = ???, data = ???, data_len = ???, mac_key = ???, store = ???) (optimized), at 0xffffffff7af16c8c (line ~112) in "hmac.c" [3] __txn_force_abort(env = ???, buffer = ???) (optimized), at 0xffffffff7af43ad4 (line ~1653) in "txn.c" [4] __log_flush_commit(env = ???, lsnp = ???, flags = ???) (optimized), at 0xffffffff7af1f514 (line ~524) in "log_put.c" [5] __log_put(env = ???, lsnp = ???, udbt = ???, flags = ???) (optimized), at 0xffffffff7af1f0f8 (line ~297) in "log_put.c" [6] __txn_regop_log(env = ???, txnp = ???, ret_lsnp = ???, flags = ???, opcode = ???, timestamp = ???, envid = ???, locks = ???) (optimized), at 0xffffffff7af448a8 (line ~237) in "txn_auto.c" [7] __txn_commit(txn = ???, flags = ???) (optimized), at 0xffffffff7af41c2c (line ~687) in "txn.c" [8] __txn_commit_pp(txn = ???, flags = ???) (optimized), at 0xffffffff7af41790 (line ~539) in "txn.c" [9] hdb_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 456 in "add.c" [10] overlay_op_walk(op = 0x10adb77b0, rs = 0xffffffff42fff998, which = op_add, oi = 0x1005dfd70, on = (nil)), line 670 in "backover.c" [11] over_op_func(op = 0x10adb77b0, rs = 0xffffffff42fff998, which = op_add), line 722 in "backover.c" [12] over_op_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 768 in "backover.c" [13] fe_op_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 334 in "add.c" [14] do_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 194 in "add.c" [15] connection_operation(ctx = 0xffffffff42fffc20, arg_v = 0x10adb77b0), line 1100 in "connection.c" [16] connection_read_thread(ctx = 0xffffffff42fffc20, argv = 0x39), line 1226 in "connection.c" [17] ldap_int_thread_pool_wrapper(xpool = 0x1005352c0), line 663 in "tpool.c"

db4 4.7.25 patched for Oracle #16406, #16415, and #16541. All other slapd threads are idle. FWIW, I'm not fully patched for #5860 (will re-checkout now, and kill the optimization on libdb4-7).

Aaron Richton

2:25 p.m.

On Mon, 26 Jan 2009, Aaron Richton wrote:

...

t@16 (l@16) terminated by signal SEGV (no mapping at the fault address)

Ehhhhhhhhh. I didn't (really) mean to send this. Although morally this shouldn't be a SEGV, this was the result of running out of disk space. I wasn't really going to press it.

Still, if there's an opportunity to more gracefully handle the negative return, that might be kinder...

Pierangelo Masarati

3:24 p.m.

Howard Chu wrote:

...

What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

Ing. Pierangelo Masarati OpenLDAP Core Team

Pierangelo Masarati

3:26 p.m.

Pierangelo Masarati wrote:

...

Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb.

I mean: test045.

...

Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

p.

Ing. Pierangelo Masarati OpenLDAP Core Team

Quanah Gibson-Mount

5:57 p.m.

--On Tuesday, January 27, 2009 12:24 AM +0100 Pierangelo Masarati ando@sys-net.it wrote:

...

Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

Using my new looping run script... I've seen the connection_read(10) line once so far, but no failures. I'm running it 100 times.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Quanah Gibson-Mount

5:59 p.m.

--On Monday, January 26, 2009 5:57 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:

...

Using my new looping run script... I've seen the connection_read(10) line once so far, but no failures. I'm running it 100 times.

also, from slapd.3.log:

ldap_chkResponseList ld 0x19c484b0 msgid 2 all 0 ldap_chkResponseList returns ld 0x19c484b0 NULL ldap_int_select read1msg: ld 0x19c484b0 msgid 2 all 0 ber_get_next ldap_free_connection 1 0 connection_get(10) ldap_free_connection: actually freed ldap_err2string connection_get(10): got connid=0 do_syncrep2: rid=001 (-1) Can't contact LDAP server connection_get(10) connection_get(10): got connid=0 do_syncrepl: rid=001 retrying (4 retries left) connection_get(10) connection_read(10): no connection! =>do_syncrepl rid=001 ldap_create ldap_url_parse_ext(ldap://localhost:9011/) ldap_sasl_bind_s ldap_sasl_bind ldap_send_initial_request ldap_new_connection 1 1 0 ldap_int_open_connection ldap_connect_to_host: TCP localhost:9011 ldap_new_socket: 10 ldap_prepare_socket: 10 ldap_connect_to_host: Trying 127.0.0.1:9011 ldap_pvt_connect: fd: 10 tm: -1 async: 0 ldap_close_socket: 10 slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (3 retries left) =>do_syncrepl rid=001 ldap_create ldap_url_parse_ext(ldap://localhost:9011/) ldap_sasl_bind_s ldap_sasl_bind ldap_send_initial_request ldap_new_connection 1 1 0 ldap_int_open_connection ldap_connect_to_host: TCP localhost:9011 ldap_new_socket: 10 ldap_prepare_socket: 10

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Howard Chu

11:01 p.m.

Pierangelo Masarati wrote:

...

Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Pierangelo Masarati

27 Jan 27 Jan

1:36 a.m.

Howard Chu wrote:

...

Pierangelo Masarati wrote:

...
Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...

Sorry for bailing out, it was 2:30am and I had to wake up early :) Now I just got it up and running past loop #70, and still running. I'll keep you informed.

Thanks, p.

Ing. Pierangelo Masarati OpenLDAP Core Team

Howard Chu

1:40 a.m.

Pierangelo Masarati wrote:

...

Howard Chu wrote:

...
Pierangelo Masarati wrote:

...
Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...

Sorry for bailing out, it was 2:30am and I had to wake up early :)

No problem, have plenty of things to keep us busy in the meantime ;)

...

Now I just got it up and running past loop #70, and still running. I'll keep you informed.

Sounds like an improvement, at least...

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Pierangelo Masarati

3:17 a.m.

Howard Chu wrote:

...

Pierangelo Masarati wrote:

...
Howard Chu wrote:

...
Pierangelo Masarati wrote:

...
Howard Chu wrote:

...
What kind of system are you running on? Linux / multiprocessor?

OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...

With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...

Sorry for bailing out, it was 2:30am and I had to wake up early :)

No problem, have plenty of things to keep us busy in the meantime ;)

...
Now I just got it up and running past loop #70, and still running. I'll keep you informed.

Sounds like an improvement, at least...

got to > 200 executions without any problem. I'll let it run, just in case.

Ing. Pierangelo Masarati OpenLDAP Core Team

Michael Ströder

5:24 a.m.

Pierangelo Masarati wrote:

...

got to > 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

Ciao, Michael.

Quanah Gibson-Mount

8:25 a.m.

--On Tuesday, January 27, 2009 2:24 PM +0100 Michael Ströder michael@stroeder.com wrote:

...

Pierangelo Masarati wrote:

...
got to > 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

Ciao, Michael.

backtrace from gdb?

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Pierangelo Masarati

8:30 a.m.

Michael Ströder wrote:

...

Pierangelo Masarati wrote:

...
got to > 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

Ing. Pierangelo Masarati OpenLDAP Core Team

Quanah Gibson-Mount

8:34 a.m.

--On Tuesday, January 27, 2009 5:30 PM +0100 Pierangelo Masarati ando@sys-net.it wrote:

...

Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
got to > 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

test019 hung on me for over half an hour, but it was because of a failed test clean up from last night before all the fixes were in.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Michael Ströder

9:38 a.m.

Pierangelo Masarati wrote:

...

Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
got to > 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.

I will try to get a gdb backtrace tomorrow.

Ciao, Michael.

Howard Chu

9:50 a.m.

Michael Ströder wrote:

...

Pierangelo Masarati wrote:

...
Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
got to> 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.

I will try to get a gdb backtrace tomorrow.

As you should know by now, reporting "I have a problem" without providing more details is pretty much useless. Meanwhile, based on others' detailed reports, the code will have changed again before tomorrow...

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Ralf Haferkamp

28 Jan 28 Jan

2:32 a.m.

Am Dienstag 27 Januar 2009 18:50:15 schrieb Howard Chu:

...

Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
got to> 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.

I will try to get a gdb backtrace tomorrow.

I was able to reproduce a hang in test039 after 15 successful iterations. It doesn't seem that slapd is completely stuck though. In slapd.3.log it repeatily logs:

ldap_result ld 0xa50b30 msgid 1 wait4msg ld 0xa50b30 msgid 1 (timeout 100000 usec) wait4msg continue ld 0xa50b30 msgid 1 all 1 ** ld 0xa50b30 Connections: * host: localhost port: 9012 (default) refcnt: 2 status: Connected last used: Wed Jan 28 01:58:26 2009

** ld 0xa50b30 Outstanding Requests: * msgid 1, origid 1, status InProgress outstanding referrals 0, parent count 0 ld 0xa50b30 request count 1 (abandoned 0) ** ld 0xa50b30 Response Queue: Empty ld 0xa50b30 response count 0 ldap_chkResponseList ld 0xa50b30 msgid 1 all 1 ldap_chkResponseList returns ld 0xa50b30 NULL ldap_int_select

Find attached a gdb backtrace, created from attaching to the process. Hope that helps.

-- regards, Ralf

Michael Ströder

5:16 a.m.

Howard Chu wrote:

...

Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
Michael Ströder wrote:

...
Pierangelo Masarati wrote:

...
got to> 200 executions without any problem. I'll let it run, just in case.

I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.

I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.

The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.

I will try to get a gdb backtrace tomorrow.

As you should know by now, reporting "I have a problem" without providing more details is pretty much useless.

As you should know by now, sometimes priorities does not allow to put more effort into digging into it. But at least reporting that there is still a problem is more information than just staying quite.

Ciao, Michael.

Gavin Henry

24 Jan 24 Jan

6:03 a.m.

----- "Quanah Gibson-Mount" quanah@zimbra.com wrote:

...

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a

number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it

All tests pass on Fedora 32bit, 64bit next.

-- Kind Regards, Gavin Henry. Managing Director. T +44 (0) 1224 279484 M +44 (0) 7930 323266 F +44 (0) 1224 824887 E ghenry@suretecsystems.com Open Source. Open Solutions(tm). http://www.suretecsystems.com/ Suretec Systems is a limited company registered in Scotland. Registered number: SC258005. Registered office: 13 Whiteley Well Place, Inverurie, Aberdeenshire, AB51 4FP. Subject to disclaimer at http://www.suretecgroup.com/disclaimer.html

Dieter Kluenter

9:23 a.m.

Quanah Gibson-Mount quanah@zimbra.com writes:

...

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.

OpenSUSE-11.0, x86 BerkeleyDB-4.5

test036 fails, just doesn't finish. the process is waiting for something that will never occur: the last lines of test036: ... PID=32586 - Search(500): base="cn=Monitor" scope=sub filter="(objectClass=*)" attrs=cn (more...). PID=32402 - Search done (0). PID=32600 - Read(1000): entry="cn=Database 1,cn=Databases,cn=Monitor". PID=32300 - Bind done 1000 in 31.316030 seconds. PID=32614 - Bind(1000): dn="". PID=32332 - Bind done (0). PID=32328 - Read done (0). PID=32388 - Bind done (0). PID=32046 - Search done (0). PID=32374 - Read done (0). PID=32088 - Search done (0). PID=32434 - Bind done (0). PID=32420 - Read done (0). PID=32508 - Read done (0). PID=32480 - Bind done (0). PID=32554 - Read done (0). PID=32522 - Bind done 1000 in 21.653445 seconds. PID=32614 - Bind done (0). PID=32600 - Read done (0). PID=32138 - Search done (0).

last lines of slapd.1.log

conn=554 op=2324 SRCH base="cn=foo,ou=meta,dc=example,dc=com" scope=0 deref=0 filter="(objectClass=*)" conn=554 op=2324 SRCH attr=1.1 ==> limits_get: conn=554 op=2324 self="[anonymous]" this="cn=foo,ou=meta,dc=example,dc=com" => bdb_search bdb_dn2entry("cn=foo,ou=meta,dc=example,dc=com") => bdb_dn2id("ou=meta,dc=example,dc=com") <= bdb_dn2id: get failed: DB_NOTFOUND: No matching key/data pair found (-30989) send_ldap_result: conn=554 op=2324 p=3 send_ldap_result: err=10 matched="dc=example,dc=com" text="" send_ldap_response: msgid=2325 tag=101 err=32 ber_flush2: 32 bytes to sd 30 conn=554 op=2324 SEARCH RESULT tag=101 err=32 nentries=0 text=

last lines of slapd.2.log

conn=15 op=2324 SRCH base="cn=foo,ou=Meta,dc=example,dc=com" scope=0 deref=0 filter="(objectClass=*)" conn=15 op=2324 SRCH attr=1.1 ==> limits_get: conn=15 op=2324 self="[anonymous]" this="cn=foo,ou=meta,dc=example,dc=com" => bdb_search bdb_dn2entry("cn=foo,ou=meta,dc=example,dc=com") => bdb_dn2id("cn=foo,ou=meta,dc=example,dc=com") <= bdb_dn2id: get failed: DB_NOTFOUND: No matching key/data pair found (-30989) send_ldap_result: conn=15 op=2324 p=3 send_ldap_result: err=10 matched="ou=Meta,dc=example,dc=com" text="" send_ldap_response: msgid=2325 tag=101 err=32 ber_flush2: 40 bytes to sd 19 conn=15 op=2324 SEARCH RESULT tag=101 err=32 nentries=0 text= connection_close: conn=19 sd=13 conn=19 fd=13 closed (idletimeout) connection_close: conn=13 sd=17 conn=13 fd=17 closed (idletimeout) connection_close: conn=14 sd=18 conn=14 fd=18 closed (idletimeout) connection_close: conn=21 sd=20 conn=21 fd=20 closed (idletimeout) connection_close: conn=17 sd=21 conn=17 fd=21 closed (idletimeout) connection_close: conn=8 sd=22 conn=8 fd=22 closed (idletimeout) connection_close: conn=18 sd=23 conn=18 fd=23 closed (idletimeout) connection_close: conn=20 sd=24 conn=20 fd=24 closed (idletimeout) connection_close: conn=15 sd=19 conn=15 fd=19 closed (idletimeout) connection_close: conn=22 sd=25 conn=22 fd=25 closed (idletimeout)

last lines of slapd.3.log

<= send_search_entry: conn 70 exit. => send_search_entry: conn 70 dn="cn=Read,cn=Waiters,cn=Monitor" ber_flush2: 55 bytes to sd 17 <= send_search_entry: conn 70 exit. => send_search_entry: conn 70 dn="cn=Write,cn=Waiters,cn=Monitor" ber_flush2: 57 bytes to sd 17 <= send_search_entry: conn 70 exit. send_ldap_result: conn=70 op=472 p=3 send_ldap_result: err=0 matched="" text="" send_ldap_response: msgid=473 tag=101 err=0 ber_flush2: 15 bytes to sd 17 connection_get(17) conn=70 op=472 SEARCH RESULT tag=101 err=0 nentries=62 text= connection_get(17): got connid=70 connection_read(17): checking for input on id=70 ber_get_next ber_get_next: tag 0x30 len 56 contents: ber_get_next conn=70 op=473 do_search ber_scanf fmt ({miiiib) ber:

...

...
...
dnPrettyNormal: <cn=Monitor>

=> ldap_bv2dn(cn=Monitor,0)

-- Dieter Klünter | Systemberatung sip: +49.180.1555.7770535 http://www.dpunkt.de/buecher/2104.html GPG Key ID:8EF7B6C6

Michael Ströder

25 Jan 25 Jan

6:05 a.m.

Dieter Kluenter wrote:

...

test036 fails, just doesn't finish.

I can confirm this with HEAD recently synced from CVS on openSUSE 11.1.

Relevant packages used: kernel-default-base-2.6.27.7-9.1 gcc-4.3-34.168 libdb-4_5-4.5.20-94.17 (fully patched I think) cyrus-sasl-2.1.22-182.1 krb5-1.6.3-132.1

Ciao, Michael.

Howard Chu

12:05 p.m.

Michael Ströder wrote:

...

Dieter Kluenter wrote:

...
test036 fails, just doesn't finish.

I can confirm this with HEAD recently synced from CVS on openSUSE 11.1.

Relevant packages used: kernel-default-base-2.6.27.7-9.1 gcc-4.3-34.168 libdb-4_5-4.5.20-94.17 (fully patched I think) cyrus-sasl-2.1.22-182.1 krb5-1.6.3-132.1

Ciao, Michael.

Please provide a gdb snapshot of the hung process.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Dieter Kluenter

24 Jan 24 Jan

9:28 a.m.

Quanah Gibson-Mount quanah@zimbra.com writes:

...

RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.

OpenSUSE-11.1, x86_64 BerkeleyDB-4.7 +patches

with db-4.7 i get following error:

/scripts/test055-valregex: line 121: 32502 Speicherzugriffsfehler $SLAPD -f $CONF1 -h $URI1 -d $LVL $TIMING > $LOG1 2>&1

...

...
...
...
...
./scripts/test055-valregex completed OK.

all tests complain about this memory allocation error, but the tests succeed

-Dieter

-- Dieter Klünter | Systemberatung http://www.dpunkt.de/buecher/2104.html sip: +49.180.1555.7770535 GPG Key ID:8EF7B6C6 53°08'09,95"N 10°08'02,42"E

6015

Age (days ago)

6020

Last active (days ago)

openldap-devel@openldap.org

37 comments

10 participants

tags (0)

participants (10)

Aaron Richton
Dieter Kluenter
Gavin Henry
Howard Chu
Jens Vagelpohl
Jonathan Clarke
Michael Ströder
Pierangelo Masarati
Quanah Gibson-Mount
Ralf Haferkamp