test008 failed in HEAD

RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/back-bdb/delete.c,v retrieving revision 1.162 diff -u -r1.162 delete.c --- delete.c 28 Aug 2007 22:13:56 -0000 1.162 +++ delete.c 3 Dec 2007 15:27:19 -0000 @@ -199,6 +199,8 @@ "<=- " LDAP_XSTRING(bdb_delete) ": no such object %s\n", op->o_req_dn.bv_val, 0, 0);

+ assert(0); + if ( matched != NULL ) { rs->sr_matched = ch_strdup( matched->e_dn ); rs->sr_ref = is_entry_referral( matched )

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

8:02 a.m.

Howard Chu wrote:

...

Hallvard B Furuseth wrote:

...
http://folk.uio.no/hbf/test008out2.tbz ldapsearch.flt = ldif.flt + cn=James A Jones 4,ou=People,dc=example,dc=com .

I just got a similar failure in RE24. Looks like the same thing Aaron reported. Too bad it takes so long to reproduce it.

I'm using this assert to try to catch it in the act:

RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/back-bdb/delete.c,v retrieving revision 1.162 diff -u -r1.162 delete.c --- delete.c 28 Aug 2007 22:13:56 -0000 1.162 +++ delete.c 3 Dec 2007 15:27:19 -0000 @@ -199,6 +199,8 @@ "<=- " LDAP_XSTRING(bdb_delete) ": no such object %s\n", op->o_req_dn.bv_val, 0, 0);
          assert(0);
           if ( matched != NULL ) {
                   rs->sr_matched = ch_strdup( matched->e_dn );
                   rs->sr_ref = is_entry_referral( matched )

It took about 10 iterations of test008 to trip it again. Still looking at the core file, the EntryInfo for the target entry but it looks like it couldn't be found in the id2entry DB. Odd.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

9:53 a.m.

Howard Chu wrote:

...

Howard Chu wrote:

...
Hallvard B Furuseth wrote:

...
http://folk.uio.no/hbf/test008out2.tbz ldapsearch.flt = ldif.flt + cn=James A Jones 4,ou=People,dc=example,dc=com .

I just got a similar failure in RE24. Looks like the same thing Aaron reported. Too bad it takes so long to reproduce it.

I'm using this assert to try to catch it in the act:

RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/back-bdb/delete.c,v retrieving revision 1.162 diff -u -r1.162 delete.c --- delete.c 28 Aug 2007 22:13:56 -0000 1.162 +++ delete.c 3 Dec 2007 15:27:19 -0000 @@ -199,6 +199,8 @@ "<=- " LDAP_XSTRING(bdb_delete) ": no such object %s\n", op->o_req_dn.bv_val, 0, 0);
          assert(0);
           if ( matched != NULL ) {
                   rs->sr_matched = ch_strdup( matched->e_dn );
                   rs->sr_ref = is_entry_referral( matched )
It took about 10 iterations of test008 to trip it again. Still looking at the core file, the EntryInfo for the target entry but it looks like it couldn't be found in the id2entry DB. Odd.

Very strange. The entry could not be found because cache_find_ndn returned an EntryInfo for the previous incarnation of the entry, which contained the old entryID. This is extremely odd given that bdb_cache_delete_internal has already successfully deleted the old record from the cache, before the previous Delete completed. On the subsequent Add, bdb_cache_add fails to insert the new EntryInfo because it has the same RDN as the old.

On my tests it almost always fails with the new entryID == 84 and the previous == 83. That's also pretty suspicious, no idea what it means.

Still sprinkling asserts in the code to try to find what's going on.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

4 Dec 4 Dec

9:25 a.m.

Howard Chu wrote:

...

Howard Chu wrote:

...
Howard Chu wrote:

...
Hallvard B Furuseth wrote:

...
http://folk.uio.no/hbf/test008out2.tbz ldapsearch.flt = ldif.flt + cn=James A Jones 4,ou=People,dc=example,dc=com .

I just got a similar failure in RE24. Looks like the same thing Aaron reported. Too bad it takes so long to reproduce it.

...

Very strange. The entry could not be found because cache_find_ndn returned an EntryInfo for the previous incarnation of the entry, which contained the old entryID. This is extremely odd given that bdb_cache_delete_internal has already successfully deleted the old record from the cache, before the previous Delete completed. On the subsequent Add, bdb_cache_add fails to insert the new EntryInfo because it has the same RDN as the old.

Stranger and stranger. It looks like BerkeleyDB's transaction locking isn't working. The problem is occurring because BDB is returning the info for an entryID that is being deleted, but whose Delete transaction hasn't yet committed.

Here's a sequence of events from slapd.1.log:

conn=5 op=66 DEL dn="cn=James A Jones 2,ou=Alumni Association,ou=People,dc=example,dc=com" bdb_dn2entry("cn=james a jones 2,ou=alumni association,ou=people,dc=example,dc=com") ==> bdb_delete: cn=James A Jones 2,ou=Alumni Association,ou=People,dc=example,dc=com bdb_dn2entry("cn=james a jones 2,ou=alumni association,ou=people,dc=example,dc=com") => bdb_dn2id_delete( "cn=james a jones 2,ou=alumni association,ou=people,dc=example,dc=com", 0x0000006b ) ... bdb_idl_delete_key: 6b @cn=james a jones 2,ou=alumni association,ou=people,dc=example,dc=com bdb_idl_delete_key: 6b %ou=alumni association,ou=people,dc=example,dc=com bdb_idl_delete_key: 6b @ou=alumni association,ou=people,dc=example,dc=com <= bdb_index_read: failed (-30990) bdb_idl_delete_key: 6b @ou=people,dc=example,dc=com <= bdb_dn2id_delete 0 => index_entry_del( 107, "cn=James A Jones 2,ou=Alumni Association,ou=People,dc=example,dc=com" ) <= index_entry_del( 107, "cn=James A Jones 2,ou=Alumni Association,ou=People,dc=example,dc=com" ) success ====> bdb_cache_delete( 107 ) ...

bdb_idl_fetch_key: [b49d1940] <= bdb_substring_candidates: 3, first=12, last=107 bdb_search_candidates: id=3 first=12 last=107 => send_search_entry: conn 28 dn="cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc=com" ber_flush2: 146 bytes to sd 20 <= send_search_entry: conn 28 exit. => send_search_entry: conn 28 dn="cn=James A Jones 2,ou=Information Technology Division,ou=People,dc=example,dc=com" ber_flush2: 157 bytes to sd 20 <= send_search_entry: conn 28 exit. add_internal "cn=james a jones 2,ou=alumni association,ou=people,dc=example,dc=com", 107 ...

bdb_delete: deleted id=0000006b dn="cn=James A Jones 2,ou=Alumni Association,ou=People,dc=example,dc=com" send_ldap_result: conn=5 op=66 p=3

The time between the "bdb_cache_delete(107)" and the "bdb_delete: deleted..." is all the txn_commit time. In between there you see that an IDL search returned a result including entryID 107, even though all of the index IDLs for that entry were already deleted. This should never happen; all of the relevant IDL and dn2id pages should still be writelocked. There's also no Deadlocks within this interval of time, so nothing aborted midstream.

I'm seeing this failure with both BDB 4.2.52 and BDB 4.6.21, and no failure in RE23, so it must be something changed between RE23 and RE24, but so far looking at the diffs I don't see anything that could have caused this change in behavior.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard B Furuseth

10:44 a.m.

If you are using gcc, it might help to compile BDB and/or OpenLDAP without -O and see if that helps. The gcc mailinglist and comp.programming.threads recently discussed a some gcc optimizations that are invalid for threaded code. E.g. gcc can turn things like

<static or extern> int v; ... if (set_v()) v += 1; into flag1 = (set_v() != 0); reg2 = v; reg2 += flag1; v = reg2;

If set_v() does trylock(lock protecting v) and returns true at success, this breaks thread safety when set_v() fails.

The optimization is valid as far as the C standard is concerned, but POSIX forbids it for threaded code. Otherwise one would have to sprinkle 'volatile' all over a threaded program, to protect it from the compiler.

-- Regards, Hallvard

Aaron Richton

11:20 a.m.

A valid suggestion, but I'll note that I'm seeing similar failures with Sun Studio on sparc and no optmization. That's pretty far off from gcc...

On Tue, 4 Dec 2007, Hallvard B Furuseth wrote:

...

If you are using gcc, it might help to compile BDB and/or OpenLDAP without -O and see if that helps. The gcc mailinglist and comp.programming.threads recently discussed a some gcc optimizations that are invalid for threaded code. E.g. gcc can turn things like

<static or extern> int v; ... if (set_v()) v += 1; into flag1 = (set_v() != 0); reg2 = v; reg2 += flag1; v = reg2;

If set_v() does trylock(lock protecting v) and returns true at success, this breaks thread safety when set_v() fails.

The optimization is valid as far as the C standard is concerned, but POSIX forbids it for threaded code. Otherwise one would have to sprinkle 'volatile' all over a threaded program, to protect it from the compiler.

-- Regards, Hallvard

Howard Chu

11:31 a.m.

Hallvard B Furuseth wrote:

...

If you are using gcc, it might help to compile BDB and/or OpenLDAP without -O and see if that helps. The gcc mailinglist and comp.programming.threads recently discussed a some gcc optimizations that are invalid for threaded code. E.g. gcc can turn things like

I was hoping it wouldn't come to this. Also, as I recall, Aaron was seeing a similar problem, and he uses the Sun compiler.

Anyway, compiling both BDB and back-bdb without -O didn't prevent the problem.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

1:36 p.m.

Howard Chu wrote:

...

I'm seeing this failure with both BDB 4.2.52 and BDB 4.6.21, and no failure in RE23, so it must be something changed between RE23 and RE24, but so far looking at the diffs I don't see anything that could have caused this change in behavior.

Looking back at 2.4.5 and 2.4.6, I got the same failure in 2.4.6. That means it's not a new regression in 2.4.7 at least. I'm still testing 2.4.5 (using the tests in 2.4.6) to see if it fails. At this point, since it's not new in 2.4.7, we should track it in the ITS. I may just hack up a quick workaround for this until I can isolate the real problem.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

5:04 p.m.

Howard Chu wrote:

...

Howard Chu wrote:

...
I'm seeing this failure with both BDB 4.2.52 and BDB 4.6.21, and no failure in RE23, so it must be something changed between RE23 and RE24, but so far looking at the diffs I don't see anything that could have caused this change in behavior.

Looking back at 2.4.5 and 2.4.6, I got the same failure in 2.4.6. That means it's not a new regression in 2.4.7 at least. I'm still testing 2.4.5 (using the tests in 2.4.6) to see if it fails. At this point, since it's not new in 2.4.7, we should track it in the ITS. I may just hack up a quick workaround for this until I can isolate the real problem.

The back-bdb diffs between 2.4.5 and 2.4.6 look totally trivial to me. However, reverting one change seems to have cured the problems in my 2.4.6 and 2.4.7 trees. Why it makes any difference is completely beyond me.

I'll let my 2.4.7 build keep running test008 for a few more hours. I'd appreciate if you could test this as well.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ Index: back-bdb.h =================================================================== RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/back-bdb/back-bdb.h,v retrieving revision 1.141.2.8 diff -u -r1.141.2.8 back-bdb.h --- back-bdb.h 26 Nov 2007 00:33:12 -0000 1.141.2.8 +++ back-bdb.h 5 Dec 2007 01:01:30 -0000 @@ -151,10 +151,10 @@ EntryInfo *c_lruhead; /* lru - add accessed entries here */ EntryInfo *c_lrutail; /* lru - rem lru entries from here */ EntryInfo c_dntree; - unsigned c_maxsize; + int c_maxsize; int c_cursize; - unsigned c_minfree; - unsigned c_eimax; + int c_minfree; + int c_eimax; int c_eiused; /* EntryInfo's in use */ int c_leaves; /* EntryInfo leaf nodes */ int c_purging; @@ -223,7 +223,7 @@ ID bi_lastid; ldap_pvt_thread_mutex_t bi_lastid_mutex; - unsigned bi_idl_cache_max_size; + int bi_idl_cache_max_size; int bi_idl_cache_size; Avlnode *bi_idl_tree; bdb_idl_cache_entry_t *bi_idl_lru_head;

Howard Chu

5:45 p.m.

Howard Chu wrote:

...

Howard Chu wrote:

...
Howard Chu wrote:

...
I'm seeing this failure with both BDB 4.2.52 and BDB 4.6.21, and no failure in RE23, so it must be something changed between RE23 and RE24, but so far looking at the diffs I don't see anything that could have caused this change in behavior.

Looking back at 2.4.5 and 2.4.6, I got the same failure in 2.4.6. That means it's not a new regression in 2.4.7 at least. I'm still testing 2.4.5 (using the tests in 2.4.6) to see if it fails. At this point, since it's not new in 2.4.7, we should track it in the ITS. I may just hack up a quick workaround for this until I can isolate the real problem.

The back-bdb diffs between 2.4.5 and 2.4.6 look totally trivial to me. However, reverting one change seems to have cured the problems in my 2.4.6 and 2.4.7 trees. Why it makes any difference is completely beyond me.

I'll let my 2.4.7 build keep running test008 for a few more hours. I'd appreciate if you could test this as well.

Never mind. It eventually failed in 2.4.7 after 35 iterations. I'm back to using the 2.4.7 tests against the 2.4.5 slapd again.

As an aside - anyone else seeing this issue, are you running on a single core or multiple cores?

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

5 Dec 5 Dec

4:44 a.m.

Dual cpus, each single core.

cpu1: UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x24 clock 1002 MHz) cpu0: UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x24 clock 1002 MHz)

Hallvard B Furuseth

6:09 a.m.

2 Quad Core CPUs. Intel Xeon E5335. 2GHz.

-- Hallvard

Howard Chu

11:18 a.m.

Hallvard B Furuseth wrote:

...

2 Quad Core CPUs. Intel Xeon E5335. 2GHz.

The code in HEAD has passed 560 iterations on my quad-socket server and is still going. It failed after 309 iterations on my single-socket dual-core machine. On the advice of the BDB folks I've added some more logging inside the BDB library to try to track it further.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Aaron Richton

6 Dec 6 Dec

2:46 p.m.

Given recent checkins, HEAD bdb test008 has ran successfully all day in an infinite loop. Will let hdb run overnight just in case it's different...

On Wed, 5 Dec 2007, Howard Chu wrote:

...

Hallvard B Furuseth wrote:

...
2 Quad Core CPUs. Intel Xeon E5335. 2GHz.

The code in HEAD has passed 560 iterations on my quad-socket server and is still going. It failed after 309 iterations on my single-socket dual-core machine. On the advice of the BDB folks I've added some more logging inside the BDB library to try to track it further. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

6:06 p.m.

Aaron Richton wrote:

...

Given recent checkins, HEAD bdb test008 has ran successfully all day in an infinite loop. Will let hdb run overnight just in case it's different...

Thanks. Yes, hdb is slightly different but not as far as test008 is concerned. I had a problem in test020 which showed a regression that's been there ever since 2.3 was released, but that was also fixed with the last round of checkins. (hdb had a race that could allow multiple entries with the same DN to be created. Never happened in 2.2.)

...

On Wed, 5 Dec 2007, Howard Chu wrote:

...
Hallvard B Furuseth wrote:

...
2 Quad Core CPUs. Intel Xeon E5335. 2GHz.

The code in HEAD has passed 560 iterations on my quad-socket server and is still going. It failed after 309 iterations on my single-socket dual-core machine. On the advice of the BDB folks I've added some more logging inside the BDB library to try to track it further.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Michael Ströder

7 Dec 7 Dec

3:44 a.m.

Howard Chu wrote:

...

Thanks. Yes, hdb is slightly different but not as far as test008 is concerned. I had a problem in test020 which showed a regression that's been there ever since 2.3 was released, but that was also fixed with the last round of checkins. (hdb had a race that could allow multiple entries with the same DN to be created. Never happened in 2.2.)

Will these fixes also be applied to RE23 (2.3.40)?

Ciao, Michael.

Howard Chu

8:14 a.m.

Michael Ströder wrote:

...

Howard Chu wrote:

...
Thanks. Yes, hdb is slightly different but not as far as test008 is concerned. I had a problem in test020 which showed a regression that's been there ever since 2.3 was released, but that was also fixed with the last round of checkins. (hdb had a race that could allow multiple entries with the same DN to be created. Never happened in 2.2.)

Will these fixes also be applied to RE23 (2.3.40)?

Are you not receiving the commit emails? They already were.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Michael Ströder

8:21 a.m.

Howard Chu wrote:

...

Michael Ströder wrote:

...
Howard Chu wrote:

...
Thanks. Yes, hdb is slightly different but not as far as test008 is concerned. I had a problem in test020 which showed a regression that's been there ever since 2.3 was released, but that was also fixed with the last round of checkins. (hdb had a race that could allow multiple entries with the same DN to be created. Never happened in 2.2.)

Will these fixes also be applied to RE23 (2.3.40)?

Are you not receiving the commit emails?

Nope.

...

They already were.

Ok, synced RE24 and RE23 and saw it also in CHANGES.

Ciao, Michael.

Howard Chu

4 Dec 4 Dec

6:29 p.m.

Howard Chu wrote:

...

Howard Chu wrote:

...
I'm seeing this failure with both BDB 4.2.52 and BDB 4.6.21, and no failure in RE23, so it must be something changed between RE23 and RE24, but so far looking at the diffs I don't see anything that could have caused this change in behavior.

Looking back at 2.4.5 and 2.4.6, I got the same failure in 2.4.6. That means it's not a new regression in 2.4.7 at least. I'm still testing 2.4.5 (using the tests in 2.4.6) to see if it fails. At this point, since it's not new in 2.4.7, we should track it in the ITS. I may just hack up a quick workaround for this until I can isolate the real problem.

Using the 2.4.7 tests, with 3 more searches added to the mix, I got the same failure in 2.4.5 and RE23. So it seems this is really a much older bug. Not sure what to do about it at the moment. Again, we've been relying on BerkeleyDB's lock subsystem to prevent this sort of event from ever happening, and clearly that's not working. I'll send a query to the BerkeleyDB folks about it.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

6415

Age (days ago)

6419

Last active (days ago)

openldap-devel@openldap.org

19 comments

4 participants

tags (0)

participants (4)

Aaron Richton
Hallvard B Furuseth
Howard Chu
Michael Ströder