Please test RE24 as it is being considered for 2.4.16. Thanks!
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
Starting test045-syncreplication-proxied ...
running defines.sh Starting master slapd on TCP/IP port 9011... Using ldapsearch to check that master slapd is running... Using ldapadd to create the context prefix entry in the master... Starting slave slapd on TCP/IP port 9012... Using ldapsearch to check that slave slapd is running... Starting proxy slapd on TCP/IP port 9013... Using ldapsearch to check that proxy slapd is running... 1 > Using ldapadd to populate the master directory... 1 < Comparing retrieved entries from master and slave... 2 > Stopping the provider, sleeping 15 seconds and restarting it... Using ldapsearch to check that master slapd is running... Using ldapmodify to modify master directory... Waiting 7 seconds for syncrepl to receive changes... Waiting 7 seconds for syncrepl to receive changes... Waiting 7 seconds for syncrepl to receive changes... 2 < Comparing retrieved entries from master and slave... test failed - master and slave databases differ
./scripts/test045-syncreplication-proxied failed (exit 1)
make[2]: *** [bdb-yes] Error 1 make[1]: *** [test] Error 2 make: *** [test] Error 2
The attached diff contains the differences between the output files server1.out.2 and server2.out.2, which is the same difference found between server1.flt and server2.flt.
jens
Jens Vagelpohl wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
The attached diff contains the differences between the output files server1.out.2 and server2.out.2, which is the same difference found between server1.flt and server2.flt.
That's not much information, probably better tar up the testrun directory.
--On Tuesday, March 10, 2009 9:48 AM +0100 Jens Vagelpohl jens@dataflake.org wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
This should be fixed now, please test further. :)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Tuesday, March 10, 2009 9:48 AM +0100 Jens Vagelpohl jens@dataflake.org wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
This should be fixed now, please test further. :)
This test still fails:
Running ./scripts/test045-syncreplication-proxied... running defines.sh Starting master slapd on TCP/IP port 9011... Using ldapsearch to check that master slapd is running... Using ldapadd to create the context prefix entry in the master... Starting slave slapd on TCP/IP port 9012... Using ldapsearch to check that slave slapd is running... Starting proxy slapd on TCP/IP port 9013... Using ldapsearch to check that proxy slapd is running... 1 > Using ldapadd to populate the master directory... 1 < Comparing retrieved entries from master and slave... 2 > Stopping the provider, sleeping 15 seconds and restarting it... Using ldapsearch to check that master slapd is running... Using ldapmodify to modify master directory... Waiting 7 seconds for syncrepl to receive changes... Waiting 7 seconds for syncrepl to receive changes... Waiting 7 seconds for syncrepl to receive changes... 2 < Comparing retrieved entries from master and slave... test failed - master and slave databases differ
Ciao, Michael.
Michael Ströder wrote:
Quanah Gibson-Mount wrote:
--On Tuesday, March 10, 2009 9:48 AM +0100 Jens Vagelpohl jens@dataflake.org wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
This should be fixed now, please test further. :)
This test still fails:
Same here. If I diff the output I get:
diff -u testrun/server1.out.2 testrun/server2.out.2 --- testrun/server1.out.2 2009-03-10 19:35:04.000000000 +0100 +++ testrun/server2.out.2 2009-03-10 19:35:04.000000000 +0100 @@ -61,11 +61,13 @@ =com objectClass: OpenLDAPperson cn: Bjorn Jensen +cn: Biiff Jensen sn: Jensen uid: bjorn seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com userPassword:: Ympvcm4= homePostalAddress: 19923 Seven Mile Rd. $ South Lyon, MI 49999 +drink: Iced Tea description: Hiker, biker title: Director, Embedded Systems postalAddress: Info Tech Division $ 535 W. William St. $ Anytown, MI 48103 @@ -74,7 +76,6 @@ pager: +1 313 555 4474 facsimileTelephoneNumber: +1 313 555 2177 telephoneNumber: +1 313 555 0355 -drink: Iced Tea
dn: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -108,23 +109,6 @@ telephoneNumber: +1 313 555 1817 associatedDomain: example.com
-dn: cn=Gern Jensen,ou=Information Technology Division,ou=People,dc=example,dc= - com -objectClass: OpenLDAPperson -cn: Gern Jensen -sn: Jensen -uid: gjensen -title: Chief Investigator, ITD -postalAddress: ITD $ 535 W. William St $ Ann Arbor, MI 48103 -seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com -drink: Coffee -homePostalAddress: 844 Brown St. Apt. 4 $ Ann Arbor, MI 48104 -description: Very odd -facsimileTelephoneNumber: +1 313 555 7557 -telephoneNumber: +1 313 555 8343 -mail: gjensen@mailgw.example.com -homePhone: +1 313 555 8844 - dn: ou=Groups,dc=example,dc=com objectClass: organizationalUnit ou: Groups @@ -278,18 +262,19 @@ cn: ITD Staff objectClass: groupOfUniqueNames uniqueMember: cn=Manager,dc=example,dc=com +uniqueMember: cn=Bjorn Jensen,ou=Information Technology Division,ou=People,dc= + example,dc=com +uniqueMember: cn=James A Jones 2,ou=Information Technology Division,ou=People, + dc=example,dc=com uniqueMember: cn=John Doe,ou=Information Technology Division,ou=People,dc=exam ple,dc=com -uniqueMember: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc - =com -uniqueMember: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc - =com
dn: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson cn: James A Jones 1 cn: James Jones cn: Jim Jones +sn: Jones uid: jaj postalAddress: Alumni Association $ 111 Maple St $ Anytown, MI 48109 seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com @@ -302,8 +287,25 @@ mail: jaj@mail.alumni.example.com facsimileTelephoneNumber: +1 313 555 4332 telephoneNumber: +1 313 555 0895 -drink: Orange Juice -sn: Jones + +dn: cn=James A Jones 2,ou=Information Technology Division,ou=People,dc=example + ,dc=com +objectClass: OpenLDAPperson +cn: James A Jones 2 +cn: James Jones +cn: Jim Jones +sn: Doe +uid: jjones +seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com +homePostalAddress: 933 Brooks $ Anytown, MI 48104 +homePhone: +1 313 555 8838 +title: Senior Manager, Information Technology Division +description: Not around very much +mail: jjones@mailgw.example.com +postalAddress: Info Tech Division $ 535 W William $ Anytown, MI 48103 +pager: +1 313 555 2833 +facsimileTelephoneNumber: +1 313 555 8688 +telephoneNumber: +1 313 555 7334
dn: cn=Jane Doe,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -387,17 +389,6 @@ objectClass: organizationalUnit ou: People
-dn: ou=Retired,ou=People,dc=example,dc=com -objectClass: organizationalUnit -ou: Retired - -dn: cn=Rosco P. Coltrane,ou=Retired,ou=People,dc=example,dc=com -objectClass: OpenLDAPperson -sn: Coltrane -uid: rosco -description: Fat tycoon -cn: Rosco P. Coltrane - dn: dc=testdomain1,dc=example,dc=com objectClass: domain dc: testdomain1
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
--On Tuesday, March 10, 2009 8:54 PM +0100 Pierangelo Masarati ando@sys-net.it wrote:
Michael Ströder wrote:
Quanah Gibson-Mount wrote:
--On Tuesday, March 10, 2009 9:48 AM +0100 Jens Vagelpohl jens@dataflake.org wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
This should be fixed now, please test further. :)
This test still fails:
Same here. If I diff the output I get:
test050 is also having interesting failures for me. I've had one segfault (corrupt CSN data), one lockup (waiting on unlocked mutexes to clear) and twice I've had server3 be empty:
-rw------- 1 quanah quanah 19849 Mar 10 14:31 server4.out -rw------- 1 quanah quanah 19849 Mar 10 14:31 server4.flt -rw------- 1 quanah quanah 377 Mar 10 14:31 server3.out -rw------- 1 quanah quanah 377 Mar 10 14:31 server3.flt -rw------- 1 quanah quanah 19849 Mar 10 14:31 server2.out -rw------- 1 quanah quanah 19849 Mar 10 14:31 server2.flt -rw------- 1 quanah quanah 19849 Mar 10 14:31 server1.out -rw------- 1 quanah quanah 19849 Mar 10 14:31 server1.flt
so the comparison fails. This is after adjusting SLEEP1 to be 20 seconds and SLEEP2 to be 30 seconds, which should be more than sufficient.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Pierangelo Masarati wrote:
Michael Ströder wrote:
Quanah Gibson-Mount wrote:
--On Tuesday, March 10, 2009 9:48 AM +0100 Jens Vagelpohl jens@dataflake.org wrote:
On Mar 10, 2009, at 00:21 , Quanah Gibson-Mount wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
On OS X 10.5.6/Intel test045 fails:
This should be fixed now, please test further. :)
This test still fails:
Same here. If I diff the output I get:
Same diff as Jens. Basically whatever else is going on has run slowly enough that the proxy consumer has used up all of its 3 second retries and has switched into 300 second retries, which is useless within the scope of the test. I've now adjusted the retries parameter with more useful values.
diff -u testrun/server1.out.2 testrun/server2.out.2 --- testrun/server1.out.2 2009-03-10 19:35:04.000000000 +0100 +++ testrun/server2.out.2 2009-03-10 19:35:04.000000000 +0100 @@ -61,11 +61,13 @@ =com objectClass: OpenLDAPperson cn: Bjorn Jensen +cn: Biiff Jensen sn: Jensen uid: bjorn seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com userPassword:: Ympvcm4= homePostalAddress: 19923 Seven Mile Rd. $ South Lyon, MI 49999 +drink: Iced Tea description: Hiker, biker title: Director, Embedded Systems postalAddress: Info Tech Division $ 535 W. William St. $ Anytown, MI 48103 @@ -74,7 +76,6 @@ pager: +1 313 555 4474 facsimileTelephoneNumber: +1 313 555 2177 telephoneNumber: +1 313 555 0355 -drink: Iced Tea
dn: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -108,23 +109,6 @@ telephoneNumber: +1 313 555 1817 associatedDomain: example.com
-dn: cn=Gern Jensen,ou=Information Technology Division,ou=People,dc=example,dc=
- com
-objectClass: OpenLDAPperson -cn: Gern Jensen -sn: Jensen -uid: gjensen -title: Chief Investigator, ITD -postalAddress: ITD $ 535 W. William St $ Ann Arbor, MI 48103 -seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com -drink: Coffee -homePostalAddress: 844 Brown St. Apt. 4 $ Ann Arbor, MI 48104 -description: Very odd -facsimileTelephoneNumber: +1 313 555 7557 -telephoneNumber: +1 313 555 8343 -mail: gjensen@mailgw.example.com -homePhone: +1 313 555 8844
- dn: ou=Groups,dc=example,dc=com objectClass: organizationalUnit ou: Groups
@@ -278,18 +262,19 @@ cn: ITD Staff objectClass: groupOfUniqueNames uniqueMember: cn=Manager,dc=example,dc=com +uniqueMember: cn=Bjorn Jensen,ou=Information Technology Division,ou=People,dc=
- example,dc=com
+uniqueMember: cn=James A Jones 2,ou=Information Technology Division,ou=People,
- dc=example,dc=com uniqueMember: cn=John Doe,ou=Information Technology
Division,ou=People,dc=exam ple,dc=com -uniqueMember: cn=Dorothy Stevens,ou=Alumni Association,ou=People,dc=example,dc
- =com
-uniqueMember: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc
=com
dn: cn=James A Jones 1,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson cn: James A Jones 1 cn: James Jones cn: Jim Jones
+sn: Jones uid: jaj postalAddress: Alumni Association $ 111 Maple St $ Anytown, MI 48109 seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com @@ -302,8 +287,25 @@ mail: jaj@mail.alumni.example.com facsimileTelephoneNumber: +1 313 555 4332 telephoneNumber: +1 313 555 0895 -drink: Orange Juice -sn: Jones
+dn: cn=James A Jones 2,ou=Information Technology Division,ou=People,dc=example
- ,dc=com
+objectClass: OpenLDAPperson +cn: James A Jones 2 +cn: James Jones +cn: Jim Jones +sn: Doe +uid: jjones +seeAlso: cn=All Staff,ou=Groups,dc=example,dc=com +homePostalAddress: 933 Brooks $ Anytown, MI 48104 +homePhone: +1 313 555 8838 +title: Senior Manager, Information Technology Division +description: Not around very much +mail: jjones@mailgw.example.com +postalAddress: Info Tech Division $ 535 W William $ Anytown, MI 48103 +pager: +1 313 555 2833 +facsimileTelephoneNumber: +1 313 555 8688 +telephoneNumber: +1 313 555 7334
dn: cn=Jane Doe,ou=Alumni Association,ou=People,dc=example,dc=com objectClass: OpenLDAPperson @@ -387,17 +389,6 @@ objectClass: organizationalUnit ou: People
-dn: ou=Retired,ou=People,dc=example,dc=com -objectClass: organizationalUnit -ou: Retired
-dn: cn=Rosco P. Coltrane,ou=Retired,ou=People,dc=example,dc=com -objectClass: OpenLDAPperson -sn: Coltrane -uid: rosco -description: Fat tycoon -cn: Rosco P. Coltrane
- dn: dc=testdomain1,dc=example,dc=com objectClass: domain dc: testdomain1
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mar 11, 2009, at 02:13 , Howard Chu wrote:
Basically whatever else is going on has run slowly enough that the proxy consumer has used up all of its 3 second retries and has switched into 300 second retries, which is useless within the scope of the test. I've now adjusted the retries parameter with more useful values.
The tests are now running fine on Mac OS X 10.5.6/Intel. I did have one failure on test050, but was unable to reproduce it after the first run...
jens
test050 fails. Often.
https://www.nbcs.rutgers.edu/~richton/test050failures.tar.gz contains 12 bad testrun directories. If it was an assert() or SEGV, I should have core for a backtrace. I don't know if generating all of them ahead of time would be considered useless spam. Ehh...then again, it's probably not readily apparent from a testrun directory if it was a SEGV or not. I'll see what I can put together; in the interim, if there's a particular run that looks particularly interesting, let me know the timestamp (from the directory name) and I can get that trace out.
On Wed, 11 Mar 2009, Aaron Richton wrote:
https://www.nbcs.rutgers.edu/~richton/test050failures.tar.gz contains 12 bad
See also https://www.nbcs.rutgers.edu/~richton/test050backtraces.tar.gz; match to testrun directories by timestamp. There are six that I focused on that have both testrun AND core available.
Oh yeah, I compile with CHECK_CSN. Where there's smoke...
big picture:
==> backtrace.2009030049 <== two cores slapd #1:
t@8 (l@8) terminated by signal SEGV (no mapping at the fault address) 0xff2c8038: memchr+0x0010: ldub [%o0], %o0 Current function is slap_parse_csn_sid 129 p = ber_bvchr( &csn, '#' );
slapd #2:
t@3 (l@3) terminated by signal ABRT (Abort) 0xff31feb0: __lwp_kill+0x0008: bgeu,a __lwp_kill+0x1c ! 0xff31fec4 Current function is connection_state_closing 686 assert( c->c_struct_state == SLAP_C_USED );
==> backtrace.2009030126 <== t@4 (l@4) terminated by signal ABRT (Abort) 0xff31feb0: __lwp_kill+0x0008: bgeu,a __lwp_kill+0x1c ! 0xff31fec4 Current function is ldap_get_option 111 assert( LDAP_VALID( ld ) );
==> backtrace.2009030140 <== t@6 (l@6) terminated by signal SEGV (no mapping at the fault address) 0xff2b467c: strlen+0x0080: ld [%o1], %o2 Current function is lutil_debug 66 vsnprintf( buffer, sizeof(buffer), fmt, vl );
==> backtrace.2009030219 <== t@4 (l@4) terminated by signal ABRT (Abort) 0xff31feb0: __lwp_kill+0x0008: bgeu,a __lwp_kill+0x1c ! 0xff31fec4 Current function is ldap_get_option 111 assert( LDAP_VALID( ld ) );
==> backtrace.2009030221 <== t@7 (l@7) terminated by signal ABRT (Abort) 0xff31feb0: __lwp_kill+0x0008: bgeu,a __lwp_kill+0x1c ! 0xff31fec4 Current function is ldap_pvt_runqueue_resched 165 assert ( e == entry );
==> backtrace.2009031933 <== t@8 (l@8) terminated by signal ABRT (Abort) 0xff31feb0: __lwp_kill+0x0008: bgeu,a __lwp_kill+0x1c ! 0xff31fec4 Current function is syncrepl_updateCookie 3018 assert( !syn->ssyn_validate( syn, si->si_cookieState->cs_vals+i ));
--On Wednesday, March 11, 2009 11:04 AM -0400 Aaron Richton richton@nbcs.rutgers.edu wrote:
On Wed, 11 Mar 2009, Aaron Richton wrote:
https://www.nbcs.rutgers.edu/~richton/test050failures.tar.gz contains 12 bad
See also https://www.nbcs.rutgers.edu/~richton/test050backtraces.tar.gz; match to testrun directories by timestamp. There are six that I focused on that have both testrun AND core available.
Excellent! I had it core one time under linux, but haven't been able to reproduce it. See ITS#6011. So I think with your backtraces we'll be able to track it down.
Also, test050 fails around 1/60 times with 2 masters being out of sync with the other two, so MMR is fairly well busted right now.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Aaron Richton wrote:
test050 fails. Often.
https://www.nbcs.rutgers.edu/~richton/test050failures.tar.gz contains 12 bad testrun directories. If it was an assert() or SEGV, I should have core for a backtrace. I don't know if generating all of them ahead of time would be considered useless spam. Ehh...then again, it's probably not readily apparent from a testrun directory if it was a SEGV or not. I'll see what I can put together; in the interim, if there's a particular run that looks particularly interesting, let me know the timestamp (from the directory name) and I can get that trace out.
I've only briefly looked at your files. I then built fresh on Solaris and am seeing a variety of failures as well. In at least one case I'm seeing a valid update being rejected because the incoming cookie seems to have been confused with another one. This happens when a NEW_COOKIE message is received. I'll note that sending NEW_COOKIE messages is a recent change (ITS#5972), and there is no valid case for them to be occurring in test050. I.e., NEW_COOKIE should be sent in a partial replication situation, where an entry changed in the naming context but it's not within the consumer's scope of interest. In test050, the consumer's scope of interest is the entire naming context. So this at least gives me one area to look for a fix.
I'm also inclined to back out #5972 and its related patches (#5973, #6001) for this release. We were looking for bug fixes and stability, and they've been quite destabilizing.
Howard Chu wrote:
In at least one case I'm seeing a valid update being rejected because the incoming cookie seems to have been confused with another one. This happens when a NEW_COOKIE message is received. I'll note that sending NEW_COOKIE messages is a recent change (ITS#5972), and there is no valid case for them to be occurring in test050. I.e., NEW_COOKIE should be sent in a partial replication situation, where an entry changed in the naming context but it's not within the consumer's scope of interest. In test050, the consumer's scope of interest is the entire naming context. So this at least gives me one area to look for a fix.
This piece of the ITS#5972 patch is part of the problem --- syncprov.c 5 Mar 2009 16:53:01 -0000 1.266 +++ syncprov.c 12 Mar 2009 08:42:54 -0000 @@ -1245,7 +1245,7 @@ } else if ( !saveit && found ) { /* send DELETE */ syncprov_qresp( opc, ss, LDAP_SYNC_DELETE ); - } else if ( !saveit ) { + } else if ( !saveit && !fc.fscope ) { syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE ); } if ( !saveit && found ) {
My diff above is also not the correct fix, which is why I haven't committed it yet.
The current operation may not have been caught by the previous if conditions for 3 reasons: 1) the change is out of the consumer's scope 2) the change doesn't match the consumer's filter 3) the change is older than the consumer's cookie
The NEW_COOKIE message must only be sent for conditions 1 and 2, but it's currently also being sent for 3. Since the cookie comparison is tacked onto the consumer's filter, an additional comparison is needed to weed this out. (Normally 3 can't be true, but this is MMR where the consumer might have already received this change from some other provider.)
I haven't looked yet, but I suspect there is a corresponding bug in the consumer where it acts on a NEW_COOKIE message whether it's valid or not.
I'm also inclined to back out #5972 and its related patches (#5973, #6001) for this release. We were looking for bug fixes and stability, and they've been quite destabilizing.
Howard Chu wrote:
Howard Chu wrote:
In at least one case I'm seeing a valid update being rejected because the incoming cookie seems to have been confused with another one. This happens when a NEW_COOKIE message is received. I'll note that sending NEW_COOKIE messages is a recent change (ITS#5972), and there is no valid case for them to be occurring in test050. I.e., NEW_COOKIE should be sent in a partial replication situation, where an entry changed in the naming context but it's not within the consumer's scope of interest. In test050, the consumer's scope of interest is the entire naming context. So this at least gives me one area to look for a fix.
This piece of the ITS#5972 patch is part of the problem --- syncprov.c 5 Mar 2009 16:53:01 -0000 1.266 +++ syncprov.c 12 Mar 2009 08:42:54 -0000 @@ -1245,7 +1245,7 @@ } else if ( !saveit&& found ) { /* send DELETE */ syncprov_qresp( opc, ss, LDAP_SYNC_DELETE );
} else if ( !saveit ) {
} else if ( !saveit&& !fc.fscope ) { syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE ); } if ( !saveit&& found ) {
My diff above is also not the correct fix, which is why I haven't committed it yet.
The current operation may not have been caught by the previous if conditions for 3 reasons:
- the change is out of the consumer's scope
- the change doesn't match the consumer's filter
- the change is older than the consumer's cookie
The NEW_COOKIE message must only be sent for conditions 1 and 2, but it's currently also being sent for 3. Since the cookie comparison is tacked onto the consumer's filter, an additional comparison is needed to weed this out. (Normally 3 can't be true, but this is MMR where the consumer might have already received this change from some other provider.)
I haven't looked yet, but I suspect there is a corresponding bug in the consumer where it acts on a NEW_COOKIE message whether it's valid or not.
The consumer seems to be OK in general - it ignores the NEW_COOKIE if it's equal to or older than its current ctxcsn. However, I've now seen another case where a consumer receives the NEW_COOKIE message from one provider before it has received the real update from any providers. So again, when the real update arrives, it is ignored.
Howard Chu wrote:
The consumer seems to be OK in general - it ignores the NEW_COOKIE if it's equal to or older than its current ctxcsn. However, I've now seen another case where a consumer receives the NEW_COOKIE message from one provider before it has received the real update from any providers. So again, when the real update arrives, it is ignored.
Appears to be the ITS#5973 patch, not 5972.
Howard Chu wrote:
Howard Chu wrote:
The consumer seems to be OK in general - it ignores the NEW_COOKIE if it's equal to or older than its current ctxcsn. However, I've now seen another case where a consumer receives the NEW_COOKIE message from one provider before it has received the real update from any providers. So again, when the real update arrives, it is ignored.
Appears to be the ITS#5973 patch, not 5972.
Reverting the syncrepl part of the ITS#5973 patch appears to let test050 pass for me now. But looking at the logs, there are a lot of NEW_COOKIE messages being sent (and there ought to be none in this scenario). Also, I'm seeing a lot of entries propagated without any CSN in the sync cookie. The results look correct, but there's still something flaky here.
--On Thursday, March 12, 2009 7:05 AM -0700 Howard Chu hyc@symas.com wrote:
Appears to be the ITS#5973 patch, not 5972.
Reverting the syncrepl part of the ITS#5973 patch appears to let test050 pass for me now. But looking at the logs, there are a lot of NEW_COOKIE messages being sent (and there ought to be none in this scenario). Also, I'm seeing a lot of entries propagated without any CSN in the sync cookie. The results look correct, but there's still something flaky here.
test050 still fails for me when looped through multiple iterations. It looks like server2-4 are the same, but server 1 is different:
heldarive 11:43:24 ~/src/openldap/openldap-2-4/qbuild/tests/testrun $ diff -u server1.flt server2.flt --- server1.flt 2009-03-12 10:32:47.081817000 -0700 +++ server2.flt 2009-03-12 10:32:48.163779000 -0700 @@ -416,10 +416,6 @@ objectClass: device cn: Server 3 Test
-dn: cn=To be deleted by server 1,dc=example,dc=com -objectClass: device -cn: To be deleted by server 1 - dn: cn=To be deleted by server 2,dc=example,dc=com objectClass: device cn: To be deleted by server 2
i.e., servers 2-4 never got the delete from server1.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
----- "Quanah Gibson-Mount" quanah@zimbra.com wrote:
--On Thursday, March 12, 2009 7:05 AM -0700 Howard Chu hyc@symas.com
I've tested again on i386 and all passing now.
Thanks.
Howard Chu wrote:
Howard Chu wrote:
In at least one case I'm seeing a valid update being rejected because the incoming cookie seems to have been confused with another one. This happens when a NEW_COOKIE message is received. I'll note that sending NEW_COOKIE messages is a recent change (ITS#5972), and there is no valid case for them to be occurring in test050. I.e., NEW_COOKIE should be sent in a partial replication situation, where an entry changed in the naming context but it's not within the consumer's scope of interest. In test050, the consumer's scope of interest is the entire naming context. So this at least gives me one area to look for a fix.
I agree, in a MMR configuration NEW_COOKIE messages should not have been sent, except possibly when the entire csn set is updated at the end of a refresh phase. But is looks more and more to me as if the fact that test050 do show these messages is a symptom of some entry updates being ignored by syncprov, or not passed to syncprov by syncrepl.
This piece of the ITS#5972 patch is part of the problem --- syncprov.c 5 Mar 2009 16:53:01 -0000 1.266 +++ syncprov.c 12 Mar 2009 08:42:54 -0000 @@ -1245,7 +1245,7 @@ } else if ( !saveit && found ) { /* send DELETE */ syncprov_qresp( opc, ss, LDAP_SYNC_DELETE );
} else if ( !saveit ) {
} else if ( !saveit && !fc.fscope ) { syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE ); } if ( !saveit && found ) {
My diff above is also not the correct fix, which is why I haven't committed it yet.
The current operation may not have been caught by the previous if conditions for 3 reasons: 1) the change is out of the consumer's scope 2) the change doesn't match the consumer's filter 3) the change is older than the consumer's cookie
The NEW_COOKIE message must only be sent for conditions 1 and 2, but it's currently also being sent for 3. Since the cookie comparison is tacked onto the consumer's filter, an additional comparison is needed to weed this out. (Normally 3 can't be true, but this is MMR where the consumer might have already received this change from some other provider.)
Syncprov generally doesn't know the exact state of its consumers in MMR configurations, since the consumers CSNs could have been updated by one of the other providers. So, the NEW_COOKIE messages should be sent in all three cases, leaving the job of filtering out the too old CSNs to the one that have enough information to do so, namely the consumer.
I haven't looked yet, but I suspect there is a corresponding bug in the consumer where it acts on a NEW_COOKIE message whether it's valid or not.
No, the consumer silently ignores updates to CSN values older (or equal) to the values it already knows about.
I'm also inclined to back out #5972 and its related patches (#5973, #6001) for this release. We were looking for bug fixes and stability, and they've been quite destabilizing.
To me it looks more as the extended test050 have triggered race conditions that already was there, and that especially the syncprov half of ITS#5973 have added to the likelihood that they should be shown.
I have run the current test050 script with the 2.4.15 source (which didn't include these patches), and with RE24 (as of two days ago) without ITS#5973, and have seen the same type of failures there. Also, had the problems been triggered by the consumers receiving NEW_COOKIE messages then I would have expected to see "too old" messages on the consumers when it ignores entries. Instead, I find no trace of the missing entries ever being passed on from the provider. But where the update is lost I haven't found out yet. The problem seem to occur when the server where entries are missing receives its updates from one of the other consumers (i.e, not directly from server1). But whether it is syncrepl on this intermediate server that fails to pass it on to syncprov, or syncprov that looses them, I don't know.
Also, I now have around 30 core files similar to the one in ITS#5999, and I have also had a number of cases where I had to kill -9 a slapd running in a tight unlock, yield, lock loop at the same place in syncprov_op_mod(). These loops have all happened when slapd should be stopping, and the mt structure looks equally invalid as with the seg. fault cases. I have no idea as to whether this has anything to do with the test050 failures or not.
Btw, all of the test050 failures I have seen due to missing replications have taken place immediately after the initial loading of the consumers from server1. This could be a coincident, but I have had enough or them to start wondering...
Rein
Rein Tollevik wrote:
To me it looks more as the extended test050 have triggered race conditions that already was there, and that especially the syncprov half of ITS#5973 have added to the likelihood that they should be shown.
I have run the current test050 script with the 2.4.15 source (which didn't include these patches), and with RE24 (as of two days ago) without ITS#5973, and have seen the same type of failures there. Also, had the problems been triggered by the consumers receiving NEW_COOKIE messages then I would have expected to see "too old" messages on the consumers when it ignores entries. Instead, I find no trace of the missing entries ever being passed on from the provider. But where the update is lost I haven't found out yet. The problem seem to occur when the server where entries are missing receives its updates from one of the other consumers (i.e, not directly from server1). But whether it is syncrepl on this intermediate server that fails to pass it on to syncprov, or syncprov that looses them, I don't know.
Also, I now have around 30 core files similar to the one in ITS#5999, and I have also had a number of cases where I had to kill -9 a slapd running in a tight unlock, yield, lock loop at the same place in syncprov_op_mod(). These loops have all happened when slapd should be stopping, and the mt structure looks equally invalid as with the seg. fault cases. I have no idea as to whether this has anything to do with the test050 failures or not.
Btw, all of the test050 failures I have seen due to missing replications have taken place immediately after the initial loading of the consumers from server1. This could be a coincident, but I have had enough or them to start wondering...
Yes, I've seen the same. My suspicion now is that it's due to an update arriving in the consumer near when it transitions from refresh to persist mode, but I haven't been able to isolate it. I also note that adding a SLEEP1 near the beginning of test050, after the consumers have been started but before the ldapadd to populate the privder, completely eliminated the problem. So there's definitely an issue there that needs to be tracked down.
I've also seen the op_mod spin during shutdown. Unfortunately with the rest of the state already destroyed we can't identify what led to it. Seems we need to run the test a few times without restarting the servers to track that.
Howard Chu wrote:
Yes, I've seen the same. My suspicion now is that it's due to an update arriving in the consumer near when it transitions from refresh to persist mode, but I haven't been able to isolate it. I also note that adding a SLEEP1 near the beginning of test050, after the consumers have been started but before the ldapadd to populate the privder, completely eliminated the problem. So there's definitely an issue there that needs to be tracked down.
OK, finally understand the situation.
Server 3's consumer is talking to Server 2 and has entered persist mode. But Server 2 is still performing a refresh against Server 1. During a refresh, individual entries have no CSN in their sync cookie, because they arrive in indeterminate order. Because there's no cookie CSN, the writes have no CSN queued either. Since there is no queued CSN, they don't get onto the psearch queue. When the refresh phase completes, and Server 2 enters its own persist phase, it receives a CSN for its cookie. Writing this cookie causes the NEW_COOKIE messages to be sent out which causes Server 3 to update its context, even though it's missing some number of entries.
Without the NEW_COOKIE message, the test succeeds because some other provider will eventually supply the missing updates. (I.e., mostly by luck because there are many servers operating at once.)
Howard Chu wrote:
Howard Chu wrote:
Yes, I've seen the same. My suspicion now is that it's due to an update arriving in the consumer near when it transitions from refresh to persist mode, but I haven't been able to isolate it. I also note that adding a SLEEP1 near the beginning of test050, after the consumers have been started but before the ldapadd to populate the privder, completely eliminated the problem. So there's definitely an issue there that needs to be tracked down.
OK, finally understand the situation.
Server 3's consumer is talking to Server 2 and has entered persist mode. But Server 2 is still performing a refresh against Server 1. During a refresh, individual entries have no CSN in their sync cookie, because they arrive in indeterminate order. Because there's no cookie CSN, the writes have no CSN queued either. Since there is no queued CSN, they don't get onto the psearch queue. When the refresh phase completes, and Server 2 enters its own persist phase, it receives a CSN for its cookie. Writing this cookie causes the NEW_COOKIE messages to be sent out which causes Server 3 to update its context, even though it's missing some number of entries.
Without the NEW_COOKIE message, the test succeeds because some other provider will eventually supply the missing updates. (I.e., mostly by luck because there are many servers operating at once.)
HEAD has now passed a couple hundred runs for me on the slow box, as well as my other test machines. I think these patches should be sync'd to RE24.
--On Saturday, March 14, 2009 2:04 AM -0700 Howard Chu hyc@symas.com wrote:
HEAD has now passed a couple hundred runs for me on the slow box, as well as my other test machines. I think these patches should be sync'd to RE24.
HEAD segfaults on me, I'll get the backtrace.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Howard Chu wrote:
HEAD has now passed a couple hundred runs for me on the slow box, as well as my other test machines. I think these patches should be sync'd to RE24.
I highly agree! Looks as if you have nailed it :-) test050 have now passed more than 1000 times with HEAD on the system where it seldom used to pass 50 with RE24.
And the ITS#5999 seg. faults and endless loops during shutdown I used to see have gone as well :-) Nice work!
Rein
--On Sunday, March 15, 2009 11:22 PM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Howard Chu wrote:
HEAD has now passed a couple hundred runs for me on the slow box, as well as my other test machines. I think these patches should be sync'd to RE24.
I highly agree! Looks as if you have nailed it :-) test050 have now passed more than 1000 times with HEAD on the system where it seldom used to pass 50 with RE24.
And the ITS#5999 seg. faults and endless loops during shutdown I used to see have gone as well :-) Nice work!
As I noted, HEAD still segfaults on me.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Sunday, March 15, 2009 11:22 PM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Howard Chu wrote:
HEAD has now passed a couple hundred runs for me on the slow box, as well as my other test machines. I think these patches should be sync'd to RE24.
I highly agree! Looks as if you have nailed it :-) test050 have now passed more than 1000 times with HEAD on the system where it seldom used to pass 50 with RE24.
And the ITS#5999 seg. faults and endless loops during shutdown I used to see have gone as well :-) Nice work!
As I noted, HEAD still segfaults on me.
Yes, I saw that. But not the backtrace you should provide. So whether your bug is anywhere related to the problems Howard have fixed or not I can only guess. To me at least, the current HEAD works way better than RE24, and fixes at least one serious replication bug that is in RE24.
Rein
--On Monday, March 16, 2009 12:13 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
As I noted, HEAD still segfaults on me.
Yes, I saw that. But not the backtrace you should provide. So whether your bug is anywhere related to the problems Howard have fixed or not I can only guess. To me at least, the current HEAD works way better than RE24, and fixes at least one serious replication bug that is in RE24.
I sent the backtrace and testrun to Howard specifically. ;) And I agree, it seems things are better.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Monday, March 16, 2009 12:13 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Yes, I saw that. But not the backtrace you should provide. So whether your bug is anywhere related to the problems Howard have fixed or not I can only guess. To me at least, the current HEAD works way better than RE24, and fixes at least one serious replication bug that is in RE24.
I sent the backtrace and testrun to Howard specifically. ;) And I agree, it seems things are better.
Hm, time to take your own "keep replies on the list" medicine? ;-)
Rein
--On Monday, March 16, 2009 12:25 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Quanah Gibson-Mount wrote:
--On Monday, March 16, 2009 12:13 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Yes, I saw that. But not the backtrace you should provide. So whether your bug is anywhere related to the problems Howard have fixed or not I can only guess. To me at least, the current HEAD works way better than RE24, and fixes at least one serious replication bug that is in RE24.
I sent the backtrace and testrun to Howard specifically. ;) And I agree, it seems things are better.
Hm, time to take your own "keep replies on the list" medicine? ;-)
Not really, I didn't say I was going to specifically send the reply on the list, and I generally spend lots of time debugging with Howard outside of the scope of the lists. ;)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Monday, March 16, 2009 12:25 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Quanah Gibson-Mount wrote:
--On Monday, March 16, 2009 12:13 AM +0100 Rein Tollevik rein@OpenLDAP.org wrote:
Yes, I saw that. But not the backtrace you should provide. So whether your bug is anywhere related to the problems Howard have fixed or not I can only guess. To me at least, the current HEAD works way better than RE24, and fixes at least one serious replication bug that is in RE24.
I sent the backtrace and testrun to Howard specifically. ;) And I agree, it seems things are better.
Hm, time to take your own "keep replies on the list" medicine? ;-)
Not really, I didn't say I was going to specifically send the reply on the list, and I generally spend lots of time debugging with Howard outside of the scope of the lists. ;)
And unfortunately I had no time to do any more debugging until now; with St. Patrick's Day this Tuesday I had gigs all weekend. I also see that the test050 run I left overnight eventually crashed, and the symptoms are the same as in Quanah's. So, there's still more to track down.
For reference:
violino:~/OD/hobj/tests/testrun> grep rid=003 !$ grep rid=003 slapd.1.log =>do_syncrepl rid=003 do_syncrepl: rid=003 retrying (9 retries left) =>do_syncrepl rid=003 =>do_syncrep2 rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 do_syncrepl: rid=003 quitting
The odd thing here of course is that it should never jump from '9 retries left' to 'quitting', there should be at least 9 failures / retry messages. Seems like we have a wild memory overwrite occurring.
Howard Chu wrote:
And unfortunately I had no time to do any more debugging until now; with St. Patrick's Day this Tuesday I had gigs all weekend. I also see that the test050 run I left overnight eventually crashed, and the symptoms are the same as in Quanah's. So, there's still more to track down.
Look as if I might have hit the same, see stack trace at the end.
For reference:
violino:~/OD/hobj/tests/testrun> grep rid=003 !$ grep rid=003 slapd.1.log =>do_syncrepl rid=003 do_syncrepl: rid=003 retrying (9 retries left) =>do_syncrepl rid=003 =>do_syncrep2 rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 do_syncrepl: rid=003 quitting
The odd thing here of course is that it should never jump from '9 retries left' to 'quitting', there should be at least 9 failures / retry messages. Seems like we have a wild memory overwrite occurring.
I assume it is quitting due to config update. Looks to me as if syncinfo structures are released while still active.
Rein
(gdb) where #0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 #1 0x0000002a968a4a0b in vfprintf () from /lib64/tls/libc.so.6 #2 0x0000002a968c4434 in vsnprintf () from /lib64/tls/libc.so.6 #3 0x0000002a958c3181 in lutil_debug (debug=<value optimized out>, level=<value optimized out>, fmt=0x448076c8 "$") at debug.c:66 #4 0x00000000004957d1 in do_syncrepl (ctx=0x44807e90, arg=0x858150) at syncrepl.c:1261 #5 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #6 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) print si $1 = (syncinfo_t *) 0x0 (gdb) print *rtask $2 = {next_sched = {tv_sec = 7598733802573148208, tv_usec = 14422794207978861}, interval = {tv_sec = 384, tv_usec = 64}, tnext = {stqe_next = 0x84bc30}, rnext = {stqe_next = 0x858870}, routine = 0, arg = 0x0, tname = 0x505cc0 "do_syncrepl", tspec = 0x857d94 "rid=004"}
(gdb) thr 8 [Switching to thread 8 (process 23265)]#0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 (gdb) frame 4 #4 0x00000000004957d1 in do_syncrepl (ctx=0x41801e90, arg=0x858a30) at syncrepl.c:1261 1261 Debug( LDAP_DEBUG_TRACE, "=>do_syncrepl %s\n", si->si_ridtxt, 0, 0 ); (gdb) print si $3 = (syncinfo_t *) 0x20 (gdb) print *rtask $4 = {next_sched = {tv_sec = 7526470944284832317, tv_usec = 7598542775770181185}, interval = {tv_sec = 8751185004989543539, tv_usec = 3683997482740818493}, tnext = {stqe_next = 0x6974202235203030}, rnext = {stqe_next = 0x333d74756f656d}, routine = 0xc0, arg = 0x20, tname = 0x84bc10 "\220\004", tspec = 0x69666e6f43657361 <Address 0x69666e6f43657361 out of bounds>}
(gdb) thr app all where Thread 11 (process 23234): #0 0x0000002a96753ffb in pthread_join () from /lib64/tls/libpthread.so.0 #1 0x0000002a9567ec85 in ldap_pvt_thread_join (thread=1082132976, thread_return=0x0) at thr_posix.c:197 #2 0x000000000042b35b in slapd_daemon () at daemon.c:2697 #3 0x0000000000418917 in main (argc=8, argv=0x7fbffff598) at main.c:948
Thread 10 (process 23249): #0 0x0000002a96928c8c in epoll_wait () from /lib64/tls/libc.so.6 #1 0x000000000042e35a in slapd_daemon_task (ptr=<value optimized out>) at daemon.c:2323 #2 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #3 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #4 0x0000000000000000 in ?? ()
Thread 9 (process 23264): #0 0x0000002a967559ba in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9567ec1e in ldap_pvt_thread_cond_wait (cond=0x7ead04, mutex=0x0) at thr_posix.c:277 #2 0x0000002a9567e3e4 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:654 #3 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #4 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #5 0x0000000000000000 in ?? ()
Thread 8 (process 23265): #0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 #1 0x0000002a968a4a0b in vfprintf () from /lib64/tls/libc.so.6 #2 0x0000002a968c4434 in vsnprintf () from /lib64/tls/libc.so.6 #3 0x0000002a958c3181 in lutil_debug (debug=<value optimized out>, level=<value optimized out>, fmt=0x418016c8 "D") at debug.c:66 #4 0x00000000004957d1 in do_syncrepl (ctx=0x41801e90, arg=0x858a30) at syncrepl.c:1261 #5 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #6 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #8 0x0000000000000000 in ?? ()
Thread 7 (process 23290): #0 0x0000002a96757e0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000000000 in ?? ()
Thread 6 (process 23291): #0 0x0000002a9691b33f in write () from /lib64/tls/libc.so.6 #1 0x0000002a968c7328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x0000002a968c7517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 #3 0x0000002a968bd013 in fputs () from /lib64/tls/libc.so.6 #4 0x0000002a958c309a in ber_error_print ( data=0x2 <Address 0x2 out of bounds>) at bprint.c:79 #5 0x0000002a9569af4e in ldap_log_printf (ld=<value optimized out>, loglvl=<value optimized out>, fmt=<value optimized out>) at print.c:60 #6 0x0000002a956931bd in ldap_free_connection (ld=0x868ed0, lc=0x2a9ae09b00, force=1, unbind=1) at request.c:777 #7 0x0000002a9568a47d in ldap_ld_free (ld=0x868ed0, close=1, sctrls=<value optimized out>, cctrls=<value optimized out>) at unbind.c:93 #8 0x0000002a9568a768 in ldap_unbind_ext (ld=0x868ed0, sctrls=0x0, cctrls=0x0) at unbind.c:52 #9 0x000000000048dd90 in syncinfo_free (sie=0x858a90, free_all=1) ---Type <return> to continue, or q <return> to quit--- at syncrepl.c:3613 #10 0x00000000004963f2 in do_syncrepl (ctx=<value optimized out>, arg=0x858ea0) at syncrepl.c:1468 #11 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #12 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #13 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #14 0x0000000000000000 in ?? ()
Thread 5 (process 23441): #0 0x0000002a96757f1d in __lll_mutex_unlock_wake () from /lib64/tls/libpthread.so.0 #1 0x00000000007eacd8 in ?? () #2 0x0000000000000002 in ?? () #3 0x0000002a96754f38 in ?? () from /lib64/tls/libpthread.so.0 #4 0x0000000000000000 in ?? ()
Thread 4 (process 23442): #0 0x0000002a96757e0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000000000 in ?? ()
Thread 3 (process 23456): #0 0x0000002a96757e0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 #1 0x0000000000000000 in ?? ()
Thread 2 (process 23458): #0 ldap_int_thread_pool_wrapper (xpool=0x7eacd0) at tpool.c:603 #1 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #2 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #3 0x0000000000000000 in ?? ()
Thread 1 (process 23457): #0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 #1 0x0000002a968a4a0b in vfprintf () from /lib64/tls/libc.so.6 #2 0x0000002a968c4434 in vsnprintf () from /lib64/tls/libc.so.6 #3 0x0000002a958c3181 in lutil_debug (debug=<value optimized out>, level=<value optimized out>, fmt=0x448076c8 "$") at debug.c:66 #4 0x00000000004957d1 in do_syncrepl (ctx=0x44807e90, arg=0x858150) at syncrepl.c:1261 #5 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #6 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #8 0x0000000000000000 in ?? () (gdb)
Rein Tollevik wrote:
Howard Chu wrote:
And unfortunately I had no time to do any more debugging until now; with St. Patrick's Day this Tuesday I had gigs all weekend. I also see that the test050 run I left overnight eventually crashed, and the symptoms are the same as in Quanah's. So, there's still more to track down.
Look as if I might have hit the same, see stack trace at the end.
For reference:
violino:~/OD/hobj/tests/testrun> grep rid=003 !$ grep rid=003 slapd.1.log =>do_syncrepl rid=003 do_syncrepl: rid=003 retrying (9 retries left) =>do_syncrepl rid=003 =>do_syncrep2 rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrep2 rid=003 olcSyncrepl: {2}rid=003 provider=ldap://localhost:9013/ binddn="cn=config" bin =>do_syncrepl rid=003 =>do_syncrepl rid=003 =>do_syncrep2 rid=003 do_syncrepl: rid=003 quitting
The odd thing here of course is that it should never jump from '9 retries left' to 'quitting', there should be at least 9 failures / retry messages. Seems like we have a wild memory overwrite occurring.
I assume it is quitting due to config update. Looks to me as if syncinfo structures are released while still active.
OK. This must be occurring because a connection_client thread is in the thread pool but hasn't started running yet when the config change occurs. So the usual mutexes aren't held yet...
Rein
(gdb) where #0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 #1 0x0000002a968a4a0b in vfprintf () from /lib64/tls/libc.so.6 #2 0x0000002a968c4434 in vsnprintf () from /lib64/tls/libc.so.6 #3 0x0000002a958c3181 in lutil_debug (debug=<value optimized out>, level=<value optimized out>, fmt=0x448076c8 "$") at debug.c:66 #4 0x00000000004957d1 in do_syncrepl (ctx=0x44807e90, arg=0x858150) at syncrepl.c:1261 #5 0x0000002a9567e415 in ldap_int_thread_pool_wrapper ( xpool=<value optimized out>) at tpool.c:663 #6 0x0000002a9675310a in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0000002a969288b3 in clone () from /lib64/tls/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) print si $1 = (syncinfo_t *) 0x0 (gdb) print *rtask $2 = {next_sched = {tv_sec = 7598733802573148208, tv_usec = 14422794207978861}, interval = {tv_sec = 384, tv_usec = 64}, tnext = {stqe_next = 0x84bc30}, rnext = {stqe_next = 0x858870}, routine = 0, arg = 0x0, tname = 0x505cc0 "do_syncrepl", tspec = 0x857d94 "rid=004"}
(gdb) thr 8 [Switching to thread 8 (process 23265)]#0 0x0000002a968d2540 in strlen () from /lib64/tls/libc.so.6 (gdb) frame 4 #4 0x00000000004957d1 in do_syncrepl (ctx=0x41801e90, arg=0x858a30) at syncrepl.c:1261 1261 Debug( LDAP_DEBUG_TRACE, "=>do_syncrepl %s\n", si->si_ridtxt, 0, 0 ); (gdb) print si $3 = (syncinfo_t *) 0x20 (gdb) print *rtask $4 = {next_sched = {tv_sec = 7526470944284832317, tv_usec = 7598542775770181185}, interval = {tv_sec = 8751185004989543539, tv_usec = 3683997482740818493}, tnext = {stqe_next = 0x6974202235203030}, rnext = {stqe_next = 0x333d74756f656d}, routine = 0xc0, arg = 0x20, tname = 0x84bc10 "\220\004", tspec = 0x69666e6f43657361<Address 0x69666e6f43657361 out of bounds>}
--On Monday, March 16, 2009 1:53 PM -0700 Howard Chu hyc@symas.com wrote:
OK. This must be occurring because a connection_client thread is in the thread pool but hasn't started running yet when the config change occurs. So the usual mutexes aren't held yet...
With current HEAD, I have over 500 passes of test050. I think it's finally good to go, and RE24 has been sync'd. Previously, I never made it past 100 iterations of test050. ;)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Tuesday, March 17, 2009 11:13 AM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, March 16, 2009 1:53 PM -0700 Howard Chu hyc@symas.com wrote:
OK. This must be occurring because a connection_client thread is in the thread pool but hasn't started running yet when the config change occurs. So the usual mutexes aren't held yet...
With current HEAD, I have over 500 passes of test050. I think it's finally good to go, and RE24 has been sync'd. Previously, I never made it past 100 iterations of test050. ;)
(this particular fix is good, that is). There's still a few outstanding issues needing investigating before we sign off on 2.4.16.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
----- "Quanah Gibson-Mount" quanah@zimbra.com wrote:
Please test RE24 as it is being considered for 2.4.16. Thanks!
test045 is failing for me on i386. Will clean out and re-run.
Thanks.