RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.
Thanks, Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.
All tests passed OK on RE24 from an hour ago here on Ubuntu 8.10 (i386).
Regards, Jonathan
Quanah Gibson-Mount wrote:
RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.
I found two apparently identical core dumps (from HEAD). The second one was from test045 with hdb (all tests went fine with bdb, apparently). The first one I don't know, as it occurred 12 minutes earlier, the test didn't detect it and I didn't save the log info from the runs. The backtrace looks identical:
test045:
#0 0x0049f402 in __kernel_vsyscall () (gdb) bt #0 0x0049f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9635328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb63e11d0, arg=0x9635328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb63e11d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x9604cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6
unknown test:
#0 0x006d3402 in __kernel_vsyscall () (gdb) bt #0 0x006d3402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x963b050) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb355d1d0, arg=0x963b050) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb355d1d0, argv=0xc) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x960acd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6
I'm currently using CentOS 5.2 kernel: 2.6.18-92.1.10.el5 Oracle DB: 4.7.25
The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Oracle DB: 4.7.25
The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.
Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
Oracle DB: 4.7.25
The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.
Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.
Well, I didn't expect too much improvement from rebuilding the Berkeley DB, as the issue seemed to be unrelated, but I might have been wrong, as no issue appeared in HEAD after re-running the test quite a few times. I'll keep trying, then move to re24 and report.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
Oracle DB: 4.7.25
The db library was built quite some time ago; I do not recall how many patches I applied, though I recall applying some. For this reason, I'm going to rebuild it right now.
Based on the fact that in the vicinity of my working directory I could only locate the first patch for 4.7.25, I infer I was missing the other two. I'm now rebuilding, will report ASAP.
Well, I didn't expect too much improvement from rebuilding the Berkeley DB, as the issue seemed to be unrelated, but I might have been wrong, as no issue appeared in HEAD after re-running the test quite a few times. I'll keep trying, then move to re24 and report.
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).
#0 0x0090f402 in __kernel_vsyscall () (gdb) bt #0 0x0090f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9214328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb4bfc1d0, arg=0x9214328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb4bfc1d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x91e3cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6
So I fear the problem is still there.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).
#0 0x0090f402 in __kernel_vsyscall () (gdb) bt #0 0x0090f402 in __kernel_vsyscall () #1 0x00c9ad20 in raise () from /lib/libc.so.6 #2 0x00c9c631 in abort () from /lib/libc.so.6 #3 0x00c9416b in __assert_fail () from /lib/libc.so.6 #4 0x08219421 in ldap_pvt_runqueue_remove (rq=0x83521c0, entry=0x9214328) at ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92 #5 0x080f567a in do_syncrepl (ctx=0xb4bfc1d0, arg=0x9214328) at ../../../ldap-2.4-src/servers/slapd/syncrepl.c:1402 #6 0x08085a71 in connection_read_thread (ctx=0xb4bfc1d0, argv=0xa) at ../../../ldap-2.4-src/servers/slapd/connection.c:1225 #7 0x082188a5 in ldap_int_thread_pool_wrapper (xpool=0x91e3cd8) at ../../../ldap-2.4-src/libraries/libldap_r/tpool.c:663 #8 0x00deb46b in start_thread () from /lib/libpthread.so.0 #9 0x00d42dbe in clone () from /lib/libc.so.6
So I fear the problem is still there.
As far as I understand, that assertion seems to be too strict. Apparently, someone else already removed the task. Perhaps, all calls to runqueue should be preceded by a check about the existence of the task itself. An excerpt of that slapd instance's logs indicates:
slap_queue_csn: queing 0x92a30c8 20090124113556.050820Z#000000#000#000000 do_syncrep2: rid=001 (-1) Can't contact LDAP server slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com " ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (3 retries left) connection_read(10): no connection! ... # hundreds of the above connection_read(10): no connection! slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com " ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (2 retries left) connection_read(10): no connection! ... # hundreds of the above ... connection_read(10): no connection! slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (2 retries left) slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (1 retries left) slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 quitting slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) slapd: ../../../ldap-2.4-src/libraries/libldap_r/rq.c:92: ldap_pvt_runqueue_remove: Assertion `e == entry' failed.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).
I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Jan 25, 2009, at 01:15 , Howard Chu wrote:
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).
I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.
jens
Jens Vagelpohl wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Jan 25, 2009, at 01:15 , Howard Chu wrote:
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
I ran 30 times test045 with HEAD and got no failures. Then re24 failed after 44 runs with the backtrace below (identical to the previous ones).
I got thru 80 runs of test045 on HEAD (prior to my syncrepl patch) and then slapd hung on shutdown, with a deadlock between connections_shutdown(), connection_closing(), and send_ldap_ber(). So, still tinkering with this.
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.
Please provide a gdb snapshot of the hung process.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Jan 25, 2009, at 21:06 , Howard Chu wrote:
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.
Please provide a gdb snapshot of the hung process.
Is there documentation that explains how to do that?
jens
--On Sunday, January 25, 2009 9:26 PM +0100 Jens Vagelpohl jens@dataflake.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Jan 25, 2009, at 21:06 , Howard Chu wrote:
test032-chain hangs for reliably me on shutdown on OS X 10.5.6/Intel.
Please provide a gdb snapshot of the hung process.
Is there documentation that explains how to do that?
http://cs.baylor.edu/~donahoo/tools/gdb/tutorial.html
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
No more failures of this kind; however, now I intermittently get replication failures:
Using ldapmodify to modify master directory... Waiting 5 seconds for syncrepl to receive changes... Waiting 5 seconds for syncrepl to receive changes... Waiting 5 seconds for syncrepl to receive changes... 2 < Comparing retrieved entries from master and slave... test failed - master and slave databases differ bash-3.2$ diff -u testrun/server1.out.2 testrun/server2.out.2 --- testrun/server1.out.2 2009-01-25 21:54:57.000000000 +0100 +++ testrun/server2.out.2 2009-01-25 21:54:57.000000000 +0100 @@ -61,11 +61,13 @@ =com objectClass: OpenLDAPperson cn: Bjorn Jensen +cn: Biiff Jensen sn: Jensen uid: bjorn seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com userPassword:: Ympvcm4= homePostalAddress: 19923 Seven Mile Rd. $ South Lyon, MI 49999 +drink: Iced Tea description: Hiker, biker title: Director, Embedded Systems postalAddress: Info Tech Division $ 535 W. William St. $ Anytown, MI 48103 @@ -74,7 +76,6 @@ pager: +1 313 555 4474 facsimileTelephoneNumber: +1 313 555 2177 telephoneNumber: +1 313 555 0355 -drink: Iced Tea
dn: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -108,23 +109,6 @@ telephoneNumber: +1 313 555 1817 associatedDomain: example.com
-dn: cn=Gern Jensen,ou=Information Technology Division,ou=People,dc=example,dc= - com -objectClass: OpenLDAPperson -cn: Gern Jensen -sn: Jensen -uid: gjensen -title: Chief Investigator, ITD -postalAddress: ITD $ 535 W. William St $ Ann Arbor, MI 48103 -seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com -drink: Coffee -homePostalAddress: 844 Brown St. Apt. 4 $ Ann Arbor, MI 48104 -description: Very odd -facsimileTelephoneNumber: +1 313 555 7557 -telephoneNumber: +1 313 555 8343 -mail: gjensen@mailgw.example.com -homePhone: +1 313 555 8844 - dn: ou=Groups,dc=example,dc=com objectClass: organizationalUnit ou: Groups @@ -278,18 +262,19 @@ cn: ITD Staff objectClass: groupOfUniqueNames uniqueMember: cn=Manager,dc=example,dc=com +uniqueMember: cn=Bjorn Jensen,ou=Information Technology Division,ou=People,dc= + example,dc=com +uniqueMember: cn=James A Jones 2,ou=Information Technology Division,ou=People, + dc=example,dc=com uniqueMember: cn=John Doe,ou=Information Technology Division,ou=People,dc=exam ple,dc=com -uniqueMember: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc - =com -uniqueMember: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc - =com
dn: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson cn: James A Jones 1 cn: James Jones cn: Jim Jones +sn: Jones uid: jaj postalAddress: Alumni Association $ 111 Maple St $ Anytown, MI 48109 seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com @@ -302,8 +287,25 @@ mail: jaj@mail.alumni.example.com facsimileTelephoneNumber: +1 313 555 4332 telephoneNumber: +1 313 555 0895 -drink: Orange Juice -sn: Jones + +dn: cn=James A Jones 2,ou=Information Technology Division,ou=People,dc=example + ,dc=com +objectClass: OpenLDAPperson +cn: James A Jones 2 +cn: James Jones +cn: Jim Jones +sn: Doe +uid: jjones +seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com +homePostalAddress: 933 Brooks $ Anytown, MI 48104 +homePhone: +1 313 555 8838 +title: Senior Manager, Information Technology Division +description: Not around very much +mail: jjones@mailgw.example.com +postalAddress: Info Tech Division $ 535 W William $ Anytown, MI 48103 +pager: +1 313 555 2833 +facsimileTelephoneNumber: +1 313 555 8688 +telephoneNumber: +1 313 555 7334
dn: cn=Jane Doe,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -387,17 +389,6 @@ objectClass: organizationalUnit ou: People
-dn: ou=Retired,ou=People,dc=example,dc=com -objectClass: organizationalUnit -ou: Retired - -dn: cn=Rosco P. Coltrane,ou=Retired,ou=People,dc=example,dc=com -objectClass: OpenLDAPperson -sn: Coltrane -uid: rosco -description: Fat tycoon -cn: Rosco P. Coltrane - dn: dc=testdomain1,dc=example,dc=com objectClass: domain dc: testdomain1
Still investigating...
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
No more failures of this kind; however, now I intermittently get replication failures:
The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
No more failures of this kind; however, now I intermittently get replication failures:
The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"
What kind of system are you running on? Linux / multiprocessor?
One of the problems with epoll() on Linux is that it wakes up for HANGUP events all the time (they are not selectable in the input options; they're delivered regardless of whether you choose to wait for them or not). This also means we can't shut the notifications off when we acknowledge/act on them. So you'll get lots of repeated wakeups for the same hangup event. The new connection_hangup() function processes these inline for normal connections, but it still falls into the connection_read thread handling for client connections, so their normal cleanup handlers can be invoked. If your server is too busy, it will take a while for the submitted thread to execute, and then you'll get a lot of these spurious messages.
I've been experimenting with epoll's edge-triggered and oneshot modes, which would prevent multiple wakeups occurring for the same event. But unfortunately, when I set that it seems that the events can't be *re-enabled* when we want them, and so slapd hangs. Still looking at this.
But that's beside the point - you shouldn't be seeing any replication failures at all, regardless of connection close handling. What else are you seeing now?
Since test008 worked nicely overnight I decided to get a bit more aggressive. I don't think it's related to connection code, but with HEAD (large batch of ADDs; local smoke tests, not from ldap-src tests):
t@16 (l@16) terminated by signal SEGV (no mapping at the fault address) Current function is __ham_func4 (optimized) 177 HASH4; current thread: t@16 =>[1] __ham_func4(dbp = ???, key = ???, len = ???) (optimized), at 0xffffffff7af16818 (line ~177) in "hash_func.c" [2] __db_chksum(hdr = ???, data = ???, data_len = ???, mac_key = ???, store = ???) (optimized), at 0xffffffff7af16c8c (line ~112) in "hmac.c" [3] __txn_force_abort(env = ???, buffer = ???) (optimized), at 0xffffffff7af43ad4 (line ~1653) in "txn.c" [4] __log_flush_commit(env = ???, lsnp = ???, flags = ???) (optimized), at 0xffffffff7af1f514 (line ~524) in "log_put.c" [5] __log_put(env = ???, lsnp = ???, udbt = ???, flags = ???) (optimized), at 0xffffffff7af1f0f8 (line ~297) in "log_put.c" [6] __txn_regop_log(env = ???, txnp = ???, ret_lsnp = ???, flags = ???, opcode = ???, timestamp = ???, envid = ???, locks = ???) (optimized), at 0xffffffff7af448a8 (line ~237) in "txn_auto.c" [7] __txn_commit(txn = ???, flags = ???) (optimized), at 0xffffffff7af41c2c (line ~687) in "txn.c" [8] __txn_commit_pp(txn = ???, flags = ???) (optimized), at 0xffffffff7af41790 (line ~539) in "txn.c" [9] hdb_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 456 in "add.c" [10] overlay_op_walk(op = 0x10adb77b0, rs = 0xffffffff42fff998, which = op_add, oi = 0x1005dfd70, on = (nil)), line 670 in "backover.c" [11] over_op_func(op = 0x10adb77b0, rs = 0xffffffff42fff998, which = op_add), line 722 in "backover.c" [12] over_op_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 768 in "backover.c" [13] fe_op_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 334 in "add.c" [14] do_add(op = 0x10adb77b0, rs = 0xffffffff42fff998), line 194 in "add.c" [15] connection_operation(ctx = 0xffffffff42fffc20, arg_v = 0x10adb77b0), line 1100 in "connection.c" [16] connection_read_thread(ctx = 0xffffffff42fffc20, argv = 0x39), line 1226 in "connection.c" [17] ldap_int_thread_pool_wrapper(xpool = 0x1005352c0), line 663 in "tpool.c"
db4 4.7.25 patched for Oracle #16406, #16415, and #16541. All other slapd threads are idle. FWIW, I'm not fully patched for #5860 (will re-checkout now, and kill the optimization on libdb4-7).
On Mon, 26 Jan 2009, Aaron Richton wrote:
t@16 (l@16) terminated by signal SEGV (no mapping at the fault address)
Ehhhhhhhhh. I didn't (really) mean to send this. Although morally this shouldn't be a SEGV, this was the result of running out of disk space. I wasn't really going to press it.
Still, if there's an opportunity to more gracefully handle the negative return, that might be kinder...
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb.
I mean: test045.
Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
--On Tuesday, January 27, 2009 12:24 AM +0100 Pierangelo Masarati ando@sys-net.it wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
Using my new looping run script... I've seen the connection_read(10) line once so far, but no failures. I'm running it 100 times.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Monday, January 26, 2009 5:57 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
Using my new looping run script... I've seen the connection_read(10) line once so far, but no failures. I'm running it 100 times.
also, from slapd.3.log:
ldap_chkResponseList ld 0x19c484b0 msgid 2 all 0 ldap_chkResponseList returns ld 0x19c484b0 NULL ldap_int_select read1msg: ld 0x19c484b0 msgid 2 all 0 ber_get_next ldap_free_connection 1 0 connection_get(10) ldap_free_connection: actually freed ldap_err2string connection_get(10): got connid=0 do_syncrep2: rid=001 (-1) Can't contact LDAP server connection_get(10) connection_get(10): got connid=0 do_syncrepl: rid=001 retrying (4 retries left) connection_get(10) connection_read(10): no connection! =>do_syncrepl rid=001 ldap_create ldap_url_parse_ext(ldap://localhost:9011/) ldap_sasl_bind_s ldap_sasl_bind ldap_send_initial_request ldap_new_connection 1 1 0 ldap_int_open_connection ldap_connect_to_host: TCP localhost:9011 ldap_new_socket: 10 ldap_prepare_socket: 10 ldap_connect_to_host: Trying 127.0.0.1:9011 ldap_pvt_connect: fd: 10 tm: -1 async: 0 ldap_close_socket: 10 slap_client_connect: URI=ldap://localhost:9011/ DN="cn=manager,dc=example,dc=com" ldap_sasl_bind_s failed (-1) do_syncrepl: rid=001 retrying (3 retries left) =>do_syncrepl rid=001 ldap_create ldap_url_parse_ext(ldap://localhost:9011/) ldap_sasl_bind_s ldap_sasl_bind ldap_send_initial_request ldap_new_connection 1 1 0 ldap_int_open_connection ldap_connect_to_host: TCP localhost:9011 ldap_new_socket: 10 ldap_prepare_socket: 10
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Pierangelo Masarati wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...
Howard Chu wrote:
Pierangelo Masarati wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...
Sorry for bailing out, it was 2:30am and I had to wake up early :) Now I just got it up and running past loop #70, and still running. I'll keep you informed.
Thanks, p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Howard Chu wrote:
Pierangelo Masarati wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...
Sorry for bailing out, it was 2:30am and I had to wake up early :)
No problem, have plenty of things to keep us busy in the meantime ;)
Now I just got it up and running past loop #70, and still running. I'll keep you informed.
Sounds like an improvement, at least...
Howard Chu wrote:
Pierangelo Masarati wrote:
Howard Chu wrote:
Pierangelo Masarati wrote:
Howard Chu wrote:
What kind of system are you running on? Linux / multiprocessor?
OK, it just happened again, after #52 runs of test052 with back-hdb. Anything I can show you? The error in the replication is the same as in my previous message, and the logs from the replication proxy contains "connection_read(10): no connection!" 343303 (!?!) times...
With the current HEAD's edge-triggered epoll() mode, you should no longer be seeing multiple connection_read() events...
Sorry for bailing out, it was 2:30am and I had to wake up early :)
No problem, have plenty of things to keep us busy in the meantime ;)
Now I just got it up and running past loop #70, and still running. I'll keep you informed.
Sounds like an improvement, at least...
got to > 200 executions without any problem. I'll let it run, just in case.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
--On Tuesday, January 27, 2009 2:24 PM +0100 Michael Ströder michael@stroeder.com wrote:
Pierangelo Masarati wrote:
got to > 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
Ciao, Michael.
backtrace from gdb?
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to > 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
--On Tuesday, January 27, 2009 5:30 PM +0100 Pierangelo Masarati ando@sys-net.it wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to > 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
test019 hung on me for over half an hour, but it was because of a failed test clean up from last night before all the fixes were in.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Pierangelo Masarati wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to > 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.
I will try to get a gdb backtrace tomorrow.
Ciao, Michael.
Michael Ströder wrote:
Pierangelo Masarati wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to> 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.
I will try to get a gdb backtrace tomorrow.
As you should know by now, reporting "I have a problem" without providing more details is pretty much useless. Meanwhile, based on others' detailed reports, the code will have changed again before tomorrow...
Am Dienstag 27 Januar 2009 18:50:15 schrieb Howard Chu:
Michael Ströder wrote:
Pierangelo Masarati wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to> 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.
I will try to get a gdb backtrace tomorrow.
I was able to reproduce a hang in test039 after 15 successful iterations. It doesn't seem that slapd is completely stuck though. In slapd.3.log it repeatily logs:
ldap_result ld 0xa50b30 msgid 1 wait4msg ld 0xa50b30 msgid 1 (timeout 100000 usec) wait4msg continue ld 0xa50b30 msgid 1 all 1 ** ld 0xa50b30 Connections: * host: localhost port: 9012 (default) refcnt: 2 status: Connected last used: Wed Jan 28 01:58:26 2009
** ld 0xa50b30 Outstanding Requests: * msgid 1, origid 1, status InProgress outstanding referrals 0, parent count 0 ld 0xa50b30 request count 1 (abandoned 0) ** ld 0xa50b30 Response Queue: Empty ld 0xa50b30 response count 0 ldap_chkResponseList ld 0xa50b30 msgid 1 all 1 ldap_chkResponseList returns ld 0xa50b30 NULL ldap_int_select
Find attached a gdb backtrace, created from attaching to the process. Hope that helps.
Howard Chu wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
Michael Ströder wrote:
Pierangelo Masarati wrote:
got to> 200 executions without any problem. I'll let it run, just in case.
I still get occasional hangs with RE24 in test008, test036 or test039 on openSUSE 11.1.
I didn't get any after running each of those tests at least 50 times each. I suspect something related to your distro.
The point is these hangs didn't occur before the reworked connection code. And I'm always using 'make distclean' in my build script. So there shouldn't be any leftovers from previous builds.
I will try to get a gdb backtrace tomorrow.
As you should know by now, reporting "I have a problem" without providing more details is pretty much useless.
As you should know by now, sometimes priorities does not allow to put more effort into digging into it. But at least reporting that there is still a problem is more information than just staying quite.
Ciao, Michael.
----- "Quanah Gibson-Mount" quanah@zimbra.com wrote:
RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a
number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it
All tests pass on Fedora 32bit, 64bit next.
Quanah Gibson-Mount quanah@zimbra.com writes:
RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.
OpenSUSE-11.0, x86 BerkeleyDB-4.5
test036 fails, just doesn't finish. the process is waiting for something that will never occur: the last lines of test036: ... PID=32586 - Search(500): base="cn=Monitor" scope=sub filter="(objectClass=*)" attrs=cn (more...). PID=32402 - Search done (0). PID=32600 - Read(1000): entry="cn=Database 1,cn=Databases,cn=Monitor". PID=32300 - Bind done 1000 in 31.316030 seconds. PID=32614 - Bind(1000): dn="". PID=32332 - Bind done (0). PID=32328 - Read done (0). PID=32388 - Bind done (0). PID=32046 - Search done (0). PID=32374 - Read done (0). PID=32088 - Search done (0). PID=32434 - Bind done (0). PID=32420 - Read done (0). PID=32508 - Read done (0). PID=32480 - Bind done (0). PID=32554 - Read done (0). PID=32522 - Bind done 1000 in 21.653445 seconds. PID=32614 - Bind done (0). PID=32600 - Read done (0). PID=32138 - Search done (0).
last lines of slapd.1.log
conn=554 op=2324 SRCH base="cn=foo,ou=meta,dc=example,dc=com" scope=0 deref=0 filter="(objectClass=*)" conn=554 op=2324 SRCH attr=1.1 ==> limits_get: conn=554 op=2324 self="[anonymous]" this="cn=foo,ou=meta,dc=example,dc=com" => bdb_search bdb_dn2entry("cn=foo,ou=meta,dc=example,dc=com") => bdb_dn2id("ou=meta,dc=example,dc=com") <= bdb_dn2id: get failed: DB_NOTFOUND: No matching key/data pair found (-30989) send_ldap_result: conn=554 op=2324 p=3 send_ldap_result: err=10 matched="dc=example,dc=com" text="" send_ldap_response: msgid=2325 tag=101 err=32 ber_flush2: 32 bytes to sd 30 conn=554 op=2324 SEARCH RESULT tag=101 err=32 nentries=0 text=
last lines of slapd.2.log
conn=15 op=2324 SRCH base="cn=foo,ou=Meta,dc=example,dc=com" scope=0 deref=0 filter="(objectClass=*)" conn=15 op=2324 SRCH attr=1.1 ==> limits_get: conn=15 op=2324 self="[anonymous]" this="cn=foo,ou=meta,dc=example,dc=com" => bdb_search bdb_dn2entry("cn=foo,ou=meta,dc=example,dc=com") => bdb_dn2id("cn=foo,ou=meta,dc=example,dc=com") <= bdb_dn2id: get failed: DB_NOTFOUND: No matching key/data pair found (-30989) send_ldap_result: conn=15 op=2324 p=3 send_ldap_result: err=10 matched="ou=Meta,dc=example,dc=com" text="" send_ldap_response: msgid=2325 tag=101 err=32 ber_flush2: 40 bytes to sd 19 conn=15 op=2324 SEARCH RESULT tag=101 err=32 nentries=0 text= connection_close: conn=19 sd=13 conn=19 fd=13 closed (idletimeout) connection_close: conn=13 sd=17 conn=13 fd=17 closed (idletimeout) connection_close: conn=14 sd=18 conn=14 fd=18 closed (idletimeout) connection_close: conn=21 sd=20 conn=21 fd=20 closed (idletimeout) connection_close: conn=17 sd=21 conn=17 fd=21 closed (idletimeout) connection_close: conn=8 sd=22 conn=8 fd=22 closed (idletimeout) connection_close: conn=18 sd=23 conn=18 fd=23 closed (idletimeout) connection_close: conn=20 sd=24 conn=20 fd=24 closed (idletimeout) connection_close: conn=15 sd=19 conn=15 fd=19 closed (idletimeout) connection_close: conn=22 sd=25 conn=22 fd=25 closed (idletimeout)
last lines of slapd.3.log
<= send_search_entry: conn 70 exit. => send_search_entry: conn 70 dn="cn=Read,cn=Waiters,cn=Monitor" ber_flush2: 55 bytes to sd 17 <= send_search_entry: conn 70 exit. => send_search_entry: conn 70 dn="cn=Write,cn=Waiters,cn=Monitor" ber_flush2: 57 bytes to sd 17 <= send_search_entry: conn 70 exit. send_ldap_result: conn=70 op=472 p=3 send_ldap_result: err=0 matched="" text="" send_ldap_response: msgid=473 tag=101 err=0 ber_flush2: 15 bytes to sd 17 connection_get(17) conn=70 op=472 SEARCH RESULT tag=101 err=0 nentries=62 text= connection_get(17): got connid=70 connection_read(17): checking for input on id=70 ber_get_next ber_get_next: tag 0x30 len 56 contents: ber_get_next conn=70 op=473 do_search ber_scanf fmt ({miiiib) ber:
dnPrettyNormal: <cn=Monitor>
=> ldap_bv2dn(cn=Monitor,0)
Dieter Kluenter wrote:
test036 fails, just doesn't finish.
I can confirm this with HEAD recently synced from CVS on openSUSE 11.1.
Relevant packages used: kernel-default-base-2.6.27.7-9.1 gcc-4.3-34.168 libdb-4_5-4.5.20-94.17 (fully patched I think) cyrus-sasl-2.1.22-182.1 krb5-1.6.3-132.1
Ciao, Michael.
Michael Ströder wrote:
Dieter Kluenter wrote:
test036 fails, just doesn't finish.
I can confirm this with HEAD recently synced from CVS on openSUSE 11.1.
Relevant packages used: kernel-default-base-2.6.27.7-9.1 gcc-4.3-34.168 libdb-4_5-4.5.20-94.17 (fully patched I think) cyrus-sasl-2.1.22-182.1 krb5-1.6.3-132.1
Ciao, Michael.
Please provide a gdb snapshot of the hung process.
Quanah Gibson-Mount quanah@zimbra.com writes:
RE24 has had some code reworking in the connections area, due to some race conditions that were being triggered. It would be useful if people could test current RE24 to see if they encounter issues. There are still a number of outstanding issues that need to be resolved before 2.4.14 will be released, but I'd appreciate a head start on making sure that at least make test is passing for folks with this new code in place. So far, it does for both Howard and I, but Michael Ströder has reported failures in his OpenSuSE builds that we can't reproduce.
OpenSUSE-11.1, x86_64 BerkeleyDB-4.7 +patches
with db-4.7 i get following error:
/scripts/test055-valregex: line 121: 32502 Speicherzugriffsfehler $SLAPD -f $CONF1 -h $URI1 -d $LVL $TIMING > $LOG1 2>&1
./scripts/test055-valregex completed OK.
all tests complain about this memory allocation error, but the tests succeed
-Dieter