<quote who="Quanah Gibson-Mount">
I've sync'd it up with everything I'm aware of as being pending in preparation for getting 2.4.7 out this week. Please test.
Passes ok on F8 i386.
Thanks!
--Quanah
Succeeded on Fedora 8 x86_64; I will run Solaris sparcv7/sparcv9 overnight.
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
sparcv9 test050 is dead in the water; I copied off the testrun directory and snapped core files just in case.
I'll make a fresh checkout and try again...
On Tue, 27 Nov 2007, Aaron Richton wrote:
Succeeded on Fedora 8 x86_64; I will run Solaris sparcv7/sparcv9 overnight.
Aaron Richton wrote:
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
sparcv9 test050 is dead in the water; I copied off the testrun directory and snapped core files just in case.
Strange, test050 passed on my Solaris10 sparcv9 system.
I'll make a fresh checkout and try again...
On Tue, 27 Nov 2007, Aaron Richton wrote:
Succeeded on Fedora 8 x86_64; I will run Solaris sparcv7/sparcv9 overnight.
I'm definitely playing whack-a-mole with varying, intermittent failures here. Look i.e.:
https://www.nbcs.rutgers.edu/~richton/richton-20071128-test039fail.tgz [~310MB uncompressed]
[...output was...] Using ldapsearch to retrieve all the entries... Filtering ldapsearch results... Filtering original ldif used to create database... Comparing filter output... comparison failed - slapd-ldap search/modification didn't succeed
./scripts/test039-glue-ldap-concurrency failed (exit 1)
*** Error code 1 make: Fatal error: Command failed for target `bdb-yes' Current working directory /free/BUILD/openldap-20071127/tests *** Error code 1 make: Fatal error: Command failed for target `test' Current working directory /free/BUILD/openldap-20071127/tests gmake: *** [test] Error 1
And I know that 039 *can* succeed from my 050 failure (must have gotten past it at one point)...
note, I did a cvs diff, don't see any RE24 changes since last night's build.
On Wed, 28 Nov 2007, Howard Chu wrote:
Aaron Richton wrote:
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
sparcv9 test050 is dead in the water; I copied off the testrun directory and snapped core files just in case.
Strange, test050 passed on my Solaris10 sparcv9 system.
I'll make a fresh checkout and try again...
On Tue, 27 Nov 2007, Aaron Richton wrote:
Succeeded on Fedora 8 x86_64; I will run Solaris sparcv7/sparcv9 overnight.
-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On November 28, 2007 9:32:47 AM -0500 Aaron Richton richton@nbcs.rutgers.edu wrote:
note, I did a cvs diff, don't see any RE24 changes since last night's build.
Not quite sure when you mean by last night's build, but there were a lot of changes to RE24 yesterday, up until around 12 PM pacific time.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Aaron Richton wrote:
I'm definitely playing whack-a-mole with varying, intermittent failures here. Look i.e.:
https://www.nbcs.rutgers.edu/~richton/richton-20071128-test039fail.tgz [~310MB uncompressed]
Looking...
Aside from some problems with /usr/bin/awk in test020, all tests passed for me on Solaris 10 sparcv9. I'll try to compare my test039 logs with your tarball; seems like you had a failed Delete operation somewhere.
[...output was...] Using ldapsearch to retrieve all the entries... Filtering ldapsearch results... Filtering original ldif used to create database... Comparing filter output... comparison failed - slapd-ldap search/modification didn't succeed
./scripts/test039-glue-ldap-concurrency failed (exit 1)
*** Error code 1 make: Fatal error: Command failed for target `bdb-yes' Current working directory /free/BUILD/openldap-20071127/tests *** Error code 1 make: Fatal error: Command failed for target `test' Current working directory /free/BUILD/openldap-20071127/tests gmake: *** [test] Error 1
And I know that 039 *can* succeed from my 050 failure (must have gotten past it at one point)...
note, I did a cvs diff, don't see any RE24 changes since last night's build.
On Wed, 28 Nov 2007, Howard Chu wrote:
Aaron Richton wrote:
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
sparcv9 test050 is dead in the water; I copied off the testrun directory and snapped core files just in case.
Strange, test050 passed on my Solaris10 sparcv9 system.
I'll make a fresh checkout and try again...
On Tue, 27 Nov 2007, Aaron Richton wrote:
Succeeded on Fedora 8 x86_64; I will run Solaris sparcv7/sparcv9 overnight.
test039 -b bdb, under watchmalloc so this in theory is The Actual Instruction:
(dbx) threads t@1 a l@1 ?() LWP suspended in __lwp_wait() t@2 a l@2 slapd_daemon_task() LWP suspended in _libc_poll() t@3 a l@3 ldap_int_thread_pool_wrapper() LWP suspended in attr_index_name_cmp() o> t@4 a l@4 ldap_int_thread_pool_wrapper() signal SIGSEGV in _ti_pthread_mutex_unlock() t@5 a l@5 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@6 a l@6 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@7 a l@7 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@8 a l@8 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@9 a l@9 ldap_int_thread_pool_wrapper() LWP suspended in match_re_C() t@10 a l@10 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@11 a l@11 ldap_int_thread_pool_wrapper() sleep on 0x10056ac10 in __lwp_park() t@12 a l@12 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@13 a l@13 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@14 a l@14 ldap_int_thread_pool_wrapper() sleep on 0xffffffff7f304650 in __lwp_park() t@15 a l@15 ldap_int_thread_pool_wrapper() LWP suspended in _ti_pthread_mutex_lock() t@16 a l@16 ldap_int_thread_pool_wrapper() sleep on 0x10053dcb0 in __lwp_park() t@17 a l@17 ldap_int_thread_pool_wrapper() LWP suspended in ordered_value_validate() t@18 a l@18 ldap_int_thread_pool_wrapper() LWP suspended in copy_pattern() (dbx) where current thread: t@4 [1] _ti_pthread_mutex_unlock(0x10053df90, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xffffffff7dd1493c =>[2] ldap_pvt_thread_mutex_unlock(mutex = 0x10053df90), line 307 in "thr_posix.c" [3] ldap_back_getconn(op = 0x1005d0990, rs = 0xffffffff78bff998, sendok = 18, binddn = 0xffffffff78bfe5e8, bindcred = 0xffffffff78bfe5d8), line 912 in "bind.c" [4] ldap_back_dobind_int(lcp = 0xffffffff78bfe8f8, op = 0x1005d0990, rs = 0xffffffff78bff998, sendok = 18, retries = 0, dolock = 1), line 1267 in "bind.c" [5] ldap_back_dobind(lcp = 0xffffffff78bfe8f8, op = 0x1005d0990, rs = 0xffffffff78bff998, sendok = LDAP_BACK_SENDERR), line 1508 in "bind.c" [6] ldap_back_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 166 in "search.c" [7] glue_sub_search(op = 0x1005d0990, rs = 0xffffffff78bff998, b0 = 0xffffffff78bfeda8, on = 0x1005389e0), line 340 in "backglue.c" [8] glue_op_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 452 in "backglue.c" [9] overlay_op_walk(op = 0x1005d0990, rs = 0xffffffff78bff998, which = op_search, oi = 0x100538800, on = 0x1005389e0), line 642 in "backover.c" [10] over_op_func(op = 0x1005d0990, rs = 0xffffffff78bff998, which = op_search), line 704 in "backover.c" [11] over_op_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 726 in "backover.c" [12] fe_op_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 368 in "search.c" [13] overlay_op_walk(op = 0x1005d0990, rs = 0xffffffff78bff998, which = op_search, oi = 0x10053c6b0, on = (nil)), line 652 in "backover.c" [14] over_op_func(op = 0x1005d0990, rs = 0xffffffff78bff998, which = op_search), line 704 in "backover.c" [15] over_op_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 726 in "backover.c" [16] do_search(op = 0x1005d0990, rs = 0xffffffff78bff998), line 217 in "search.c" [17] connection_operation(ctx = 0xffffffff78bffc28, arg_v = 0x1005d0990), line 1083 in "connection.c" [18] connection_read_thread(ctx = 0xffffffff78bffc28, argv = 0xd), line 1210 in "connection.c" [19] ldap_int_thread_pool_wrapper(xpool = 0x1004f2ad0), line 625 in "tpool.c"
I can tar up the testrun directory (or the entire source tree) if interested.
Aaron Richton wrote:
test039 -b bdb, under watchmalloc so this in theory is The Actual Instruction:
I can tar up the testrun directory (or the entire source tree) if interested.
What OS and compiler version? Can you rule out hardware errors here? Compiler optimization bugs? This trace shows an actual SEGV in back-ldap, whereas your previous testrun directory doesn't show any faults.
Your previous one shows a successful Add of an entry (James A Jones 3) but subsequent references to it get No Such Object. That implies an index corruption, but the actual database files look ok. Also the offending entry is present in the final ldapsearch output, so really the DB seems consistent.
Any problem in back-bdb should have turned up in test008 first.
What OS and compiler version? Can you rule out hardware errors here? Compiler optimization bugs? This trace shows an actual SEGV in back-ldap, whereas your previous testrun directory doesn't show any faults.
Solaris 9, Sun Studio 12, fully patched. I don't optimize OpenLDAP. CFLAGS="-g -xs -KPIC -xarch=v9". It's the same hardware I've been using for OpenLDAP testing for years. (Obviously compilers/patch levels/etc. change over time, but still...). And I have had other segv's with this checkout, e.g. earlier:
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
2.3.39 tested clean on the same box on Nov 8, 2.3.38 worked on Oct 5, 2.3.37 worked on Jul 30, RE24 (prior to 2.4.6) worked on October 23, etc...
I was thinking about this -- watchmalloc dying in a mutex unlock. Is there a chance that it's unlocking something it doesn't really hold? I vaguely remember this being a "feature" of Sun's implementation, but it's been a while...
Any problem in back-bdb should have turned up in test008 first.
Fair enough. I'll try stressing test008.
Aaron Richton wrote:
What OS and compiler version? Can you rule out hardware errors here? Compiler optimization bugs? This trace shows an actual SEGV in back-ldap, whereas your previous testrun directory doesn't show any faults.
Solaris 9, Sun Studio 12, fully patched. I don't optimize OpenLDAP. CFLAGS="-g -xs -KPIC -xarch=v9". It's the same hardware I've been using for OpenLDAP testing for years. (Obviously compilers/patch levels/etc. change over time, but still...). And I have had other segv's with this checkout, e.g. earlier:
I had one core dump overnight, but I couldn't reproduce it under a memory debugger, so I just decided to run the whole suite again.
2.3.39 tested clean on the same box on Nov 8, 2.3.38 worked on Oct 5, 2.3.37 worked on Jul 30, RE24 (prior to 2.4.6) worked on October 23, etc...
I was thinking about this -- watchmalloc dying in a mutex unlock. Is there a chance that it's unlocking something it doesn't really hold? I vaguely remember this being a "feature" of Sun's implementation, but it's been a while...
I don't think that results in a SEGV though. Also, there were no changes to back-ldap between 2.4.6 and current RE24. There were no changes to back-ldap between October 23 2007 and now. So to get a SEGV here now implies that something outside the OpenLDAP source has changed.
Any problem in back-bdb should have turned up in test008 first.
Fair enough. I'll try stressing test008.
Took a new (ordered integers fixed) checkout. At this point I'm reproducibly failing test021. I put up a failed testrun directory at
https://www.nbcs.rutgers.edu/~richton/richton-20071130-test021fail.tgz
Aaron Richton wrote:
Took a new (ordered integers fixed) checkout. At this point I'm reproducibly failing test021. I put up a failed testrun directory at
https://www.nbcs.rutgers.edu/~richton/richton-20071130-test021fail.tgz
Argh. Thanks, fixed now in HEAD.