I'm working with a test mirror mode setup on 2.4.7 with db 4.5.20 and seeing issues with SyncRepl. Specifically, do_syncrepl fails with an initial error 0x10 and subsequent 0x14, though at least one modification is propagated. To put another way:
*Systems using same ldif to populate and running nearly identical slapd.conf files (serverID is the only variance).
1) start server 1 2) start server 2 3) add host attribute to posixAccount entry on s-1 4) attribute seen on s-2 but results in the following log 5) no other updates successful until server process restarted.
Mar 4 10:46:14 slapd[22999]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Mar 4 10:46:14 slapd[22999]: do_syncrep2: cookie=rid=001,sid=002,csn=20080304164630.523962Z#000000#001#000000 Mar 4 10:46:19 slapd[22999]: conn=0 fd=25 ACCEPT from IP=172.30.24.5:53140 (IP=0.0.0.0:389) Mar 4 10:46:19 slapd[22999]: conn=0 op=0 BIND dn="uid=syncrepl,ou=ldap,dc=example,dc=com" method=128 Mar 4 10:46:19 slapd[22999]: conn=0 op=0 BIND dn="uid=syncrepl,ou=ldap,dc=example,dc=com" mech=SIMPLE ssf=0 Mar 4 10:46:19 slapd[22999]: conn=0 op=0 RESULT tag=97 err=0 text= Mar 4 10:46:19 slapd[22999]: conn=0 op=1 SRCH base="cn=accesslog" scope=2 deref=0 filter="(reqResult=0)" Mar 4 10:46:19 slapd[22999]: conn=0 op=1 SRCH attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN Mar 4 10:46:45 slapd[22999]: conn=1 fd=26 ACCEPT from IP=172.16.71.134:65251 (IP=0.0.0.0:389) Mar 4 10:46:45 slapd[22999]: conn=1 op=0 BIND dn="cn=manager,dc=example,dc=com" method=128 Mar 4 10:46:45 slapd[22999]: conn=1 op=0 BIND dn="cn=manager,dc=example,dc=com" mech=SIMPLE ssf=0 Mar 4 10:46:45 slapd[22999]: conn=1 op=0 RESULT tag=97 err=0 text= Mar 4 10:46:45 slapd[22999]: conn=1 op=1 SRCH base="cn=accesslog" scope=2 deref=0 filter="(objectClass=*)" Mar 4 10:46:45 slapd[22999]: conn=1 op=1 SEARCH RESULT tag=101 err=0 nentries=2 text= Mar 4 10:46:45 slapd[22999]: conn=1 op=2 UNBIND Mar 4 10:46:45 slapd[22999]: conn=1 fd=26 closed Mar 4 10:46:45 slapd[22999]: conn=2 fd=26 ACCEPT from IP=192.168.15.32:45137 (IP=0.0.0.0:389) Mar 4 10:46:45 slapd[22999]: conn=2 fd=26 closed (connection lost) Mar 4 10:47:07 slapd[22999]: do_syncrep2: cookie=rid=001,sid=002,csn=20080304164728.385080Z#000000#001#000000 Mar 4 10:47:07 slapd[22999]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (0) Mar 4 10:47:07 slapd[22999]: null_callback : error code 0x10 Mar 4 10:47:07 slapd[22999]: syncrepl_updateCookie: rid=001 be_modify failed (16) Mar 4 10:47:07 slapd[22999]: do_syncrepl: rid=001 retrying Mar 4 10:47:17 slapd[22999]: null_callback : error code 0x14 Mar 4 10:47:17 slapd[22999]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (20) **** LAST 3 LINES REPEATED THEREAFTER ****
A little more verbose output on the consumer side shows some problem modifying the underlying DB, but I can't see an obvious where to go from here.
Mar 4 10:57:10 slapd[23057]: =>do_syncrepl rid=001 Mar 4 10:57:10 slapd[23057]: =>do_syncrep2 rid=001 Mar 4 10:57:10 slapd[23057]: >>> dnPrettyNormal: <uid=tguy,ou=employees,ou=people,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnPrettyNormal: <uid=tguy,ou=employees,ou=people,dc=example,dc=com>, <uid=tguy,ou=employees,ou=people,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: >>> dnPretty: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnPretty: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: >>> dnNormalize: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnNormalize: <cn=manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: ndn: "uid=tguy,ou=employees,ou=people,dc=example,dc=com" Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: oc: "(null)", at: "(null)" Mar 4 10:57:10 slapd[23057]: bdb_dn2entry("uid=tguy,ou=employees,ou=people,dc=example,dc=com") Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: found entry: "uid=tguy,ou=employees,ou=people,dc=example,dc=com" Mar 4 10:57:10 slapd[23057]: bdb_entry_get: rc=0 Mar 4 10:57:10 slapd[23057]: hdb_modify: uid=tguy,ou=employees,ou=people,dc=example,dc=com Mar 4 10:57:10 slapd[23057]: bdb_dn2entry("uid=tguy,ou=employees,ou=people,dc=example,dc=com") Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: 0x00000a2b: uid=tguy,ou=employees,ou=people,dc=example,dc=com Mar 4 10:57:10 slapd[23057]: <= acl_access_allowed: granted to database root Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: delete host Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: 16 modify/delete: host: no such value Mar 4 10:57:10 slapd[23057]: hdb_modify: modify failed (16) Mar 4 10:57:10 slapd[23057]: send_ldap_result: conn=-1 op=0 p=0 Mar 4 10:57:10 slapd[23057]: send_ldap_result: err=16 matched="" text="modify/delete: host: no such value" Mar 4 10:57:10 slapd[23057]: null_callback : error code 0x10 Mar 4 10:57:10 slapd[23057]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (16)
Again, if multiple, successful writes are made to the 'primary' server in the mirror, only the first makes it to the 'secondary.' If the second server is restarted, it retrieves all new edits as expected.
Can someone elaborate on what bdb_modify_internal is trying to tell me?
Please let me know if more information is required.
Thanks.
Slapd.conf
# Global Settings #################
include /etc/openldap/schema/core.schema include /etc/openldap/schema/cosine.schema include /etc/openldap/schema/nis.schema include /etc/openldap/schema/inetorgperson.schema include /etc/openldap/schema/sendmail.schema include /etc/openldap/schema/solaris.schema include /etc/openldap/schema/example.schema
pidfile /var/run/openldap/slapd.pid argsfile /var/run/openldap/slapd.args sizelimit unlimited
# Loglevel ##########
loglevel any
moduleload /usr/local/libexec/openldap/back_hdb.la moduleload /usr/local/libexec/openldap/accesslog.la
# TLS/SSL #########
TLSCipherSuite HIGH:MEDIUM:+SSLv2 TLSCertificateFile /etc/openldap/certs/ldap.example.com.crt TLSCertificateKeyFile /etc/openldap/certs/ldap.example.com.key TLSCACertificateFile /etc/openldap/certs/ca.crt TLSVerifyClient never
# Access log ############
database bdb suffix cn=accesslog directory /var/lib/openldap-data/accesslog rootdn cn=Manager,cn=accesslog rootpw secret index default eq index reqStart,objectClass,entryCSN,reqResult eq cachesize 3000 idlcachesize 9000
overlay syncprov syncprov-nopresent TRUE syncprov-reloadhint TRUE
# Primary DB Configuration ##########################
database hdb suffix "dc=example,dc=com" rootdn "cn=Manager,dc=example,dc=com" rootpw secret directory /var/lib/openldap-data/example.com cachesize 3000 idlcachesize 9000
# Primary DB Indexes ####################
index objectClass,entryUUID,entryCSN eq index cn,sn,uid eq,sub,pres,approx index sendmailMTAMapName eq,pres index sendmailMTAMapValue eq index sendmailMTAKey eq,sub,pres index sendmailMTAAliasValue eq,pres index sendmailMTAClassName eq index sendmailMTAClassValue eq index exampleUserModifiable eq
# Primary DB SyncRepl #####################
syncrepl rid=1 provider=ldap://ldap.example.com type=refreshAndPersist searchbase="dc=example,dc=com" bindmethod=simple binddn="uid=syncrepl,ou=ldap,dc=example,dc=com" credentials="syncpw" retry="10 +" schemachecking=off logbase="cn=accesslog" syncdata=accesslog logfilter="(reqResult=0)"
mirrormode on serverID 2
# Primary DB Sync Provider ##########################
overlay syncprov syncprov-checkpoint 10 1 syncprov-sessionlog 100
# Primary DB accesslog ######################
overlay accesslog logdb cn=accesslog logops writes logsuccess TRUE logpurge 07+00:00 01+00:00
# ACLs for Primary DB #####################
access to attrs=userPassword by anonymous auth by self write by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by * none
access to dn.subtree="ou=employees,ou=people,dc=example,dc=com" by anonymous auth by self write by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by * none
access to dn.subtree="ou=group,dc=example,dc=com" by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read
access to dn.subtree="ou=ldap,dc=example,dc=com" by anonymous auth by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read
access to dn="ou=sendmail,dc=example,dc=com" by ssf=112 dn.base="uid=sendmail,ou=ldap,dc=example,dc=com" read by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read
access to dn.subtree="dc=example,dc=com" by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read
access to dn.base="" by * read
Do you have different behavior if you make the update to ldap server 2 and it tries to replicate to ldap server 1?
Sellers On Mar 4, 2008, at 12:20 PM, Nick Geron wrote:
I'm working with a test mirror mode setup on 2.4.7 with db 4.5.20 and seeing issues with SyncRepl. Specifically, do_syncrepl fails with an initial error 0x10 and subsequent 0x14, though at least one modification is propagated. To put another way:
*Systems using same ldif to populate and running nearly identical slapd.conf files (serverID is the only variance).
- start server 1
- start server 2
- add host attribute to posixAccount entry on s-1
- attribute seen on s-2 but results in the following log
- no other updates successful until server process restarted.
Mar 4 10:46:14 slapd[22999]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE
- REFRESH_DELETE
Mar 4 10:46:14 slapd[22999]: do_syncrep2: cookie=rid=001,sid=002,csn=20080304164630.523962Z#000000#001#000000 Mar 4 10:46:19 slapd[22999]: conn=0 fd=25 ACCEPT from IP=172.30.24.5:53140 (IP=0.0.0.0:389) Mar 4 10:46:19 slapd[22999]: conn=0 op=0 BIND dn="uid=syncrepl,ou=ldap,dc=example,dc=com" method=128 Mar 4 10:46:19 slapd[22999]: conn=0 op=0 BIND dn="uid=syncrepl,ou=ldap,dc=example,dc=com" mech=SIMPLE ssf=0 Mar 4 10:46:19 slapd[22999]: conn=0 op=0 RESULT tag=97 err=0 text= Mar 4 10:46:19 slapd[22999]: conn=0 op=1 SRCH base="cn=accesslog" scope=2 deref=0 filter="(reqResult=0)" Mar 4 10:46:19 slapd[22999]: conn=0 op=1 SRCH attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN Mar 4 10:46:45 slapd[22999]: conn=1 fd=26 ACCEPT from IP=172.16.71.134:65251 (IP=0.0.0.0:389) Mar 4 10:46:45 slapd[22999]: conn=1 op=0 BIND dn="cn=manager,dc=example,dc=com" method=128 Mar 4 10:46:45 slapd[22999]: conn=1 op=0 BIND dn="cn=manager,dc=example,dc=com" mech=SIMPLE ssf=0 Mar 4 10:46:45 slapd[22999]: conn=1 op=0 RESULT tag=97 err=0 text= Mar 4 10:46:45 slapd[22999]: conn=1 op=1 SRCH base="cn=accesslog" scope=2 deref=0 filter="(objectClass=*)" Mar 4 10:46:45 slapd[22999]: conn=1 op=1 SEARCH RESULT tag=101 err=0 nentries=2 text= Mar 4 10:46:45 slapd[22999]: conn=1 op=2 UNBIND Mar 4 10:46:45 slapd[22999]: conn=1 fd=26 closed Mar 4 10:46:45 slapd[22999]: conn=2 fd=26 ACCEPT from IP=192.168.15.32:45137 (IP=0.0.0.0:389) Mar 4 10:46:45 slapd[22999]: conn=2 fd=26 closed (connection lost) Mar 4 10:47:07 slapd[22999]: do_syncrep2: cookie=rid=001,sid=002,csn=20080304164728.385080Z#000000#001#000000 Mar 4 10:47:07 slapd[22999]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (0) Mar 4 10:47:07 slapd[22999]: null_callback : error code 0x10 Mar 4 10:47:07 slapd[22999]: syncrepl_updateCookie: rid=001 be_modify failed (16) Mar 4 10:47:07 slapd[22999]: do_syncrepl: rid=001 retrying Mar 4 10:47:17 slapd[22999]: null_callback : error code 0x14 Mar 4 10:47:17 slapd[22999]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (20) **** LAST 3 LINES REPEATED THEREAFTER ****
A little more verbose output on the consumer side shows some problem modifying the underlying DB, but I can't see an obvious where to go from here.
Mar 4 10:57:10 slapd[23057]: =>do_syncrepl rid=001 Mar 4 10:57:10 slapd[23057]: =>do_syncrep2 rid=001 Mar 4 10:57:10 slapd[23057]: >>> dnPrettyNormal: <uid=tguy,ou=employees,ou=people,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnPrettyNormal: <uid=tguy,ou=employees,ou=people,dc=example,dc=com>, <uid=tguy,ou=employees,ou=people,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: >>> dnPretty: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnPretty: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: >>> dnNormalize: <cn=Manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: <<< dnNormalize: <cn=manager,dc=example,dc=com> Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: ndn: "uid=tguy,ou=employees,ou=people,dc=example,dc=com" Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: oc: "(null)", at: "(null)" Mar 4 10:57:10 slapd[23057]: bdb_dn2entry("uid=tguy,ou=employees,ou=people,dc=example,dc=com") Mar 4 10:57:10 slapd[23057]: => bdb_entry_get: found entry: "uid=tguy,ou=employees,ou=people,dc=example,dc=com" Mar 4 10:57:10 slapd[23057]: bdb_entry_get: rc=0 Mar 4 10:57:10 slapd[23057]: hdb_modify: uid=tguy,ou=employees,ou=people,dc=example,dc=com Mar 4 10:57:10 slapd[23057]: bdb_dn2entry("uid=tguy,ou=employees,ou=people,dc=example,dc=com") Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: 0x00000a2b: uid=tguy,ou=employees,ou=people,dc=example,dc=com Mar 4 10:57:10 slapd[23057]: <= acl_access_allowed: granted to database root Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: delete host Mar 4 10:57:10 slapd[23057]: bdb_modify_internal: 16 modify/delete: host: no such value Mar 4 10:57:10 slapd[23057]: hdb_modify: modify failed (16) Mar 4 10:57:10 slapd[23057]: send_ldap_result: conn=-1 op=0 p=0 Mar 4 10:57:10 slapd[23057]: send_ldap_result: err=16 matched="" text="modify/delete: host: no such value" Mar 4 10:57:10 slapd[23057]: null_callback : error code 0x10 Mar 4 10:57:10 slapd[23057]: syncrepl_message_to_op: rid=001 be_modify uid=tguy,ou=employees,ou=people,dc=example,dc=com (16)
Again, if multiple, successful writes are made to the 'primary' server in the mirror, only the first makes it to the 'secondary.' If the second server is restarted, it retrieves all new edits as expected.
Can someone elaborate on what bdb_modify_internal is trying to tell me?
Please let me know if more information is required.
Thanks.
Slapd.conf
# Global Settings #################
include /etc/openldap/schema/core.schema include /etc/openldap/schema/cosine.schema include /etc/openldap/schema/nis.schema include /etc/openldap/schema/inetorgperson.schema include /etc/openldap/schema/sendmail.schema include /etc/openldap/schema/solaris.schema include /etc/openldap/schema/example.schema
pidfile /var/run/openldap/slapd.pid argsfile /var/run/openldap/slapd.args sizelimit unlimited
# Loglevel ##########
loglevel any
moduleload /usr/local/libexec/openldap/back_hdb.la moduleload /usr/local/libexec/openldap/accesslog.la
# TLS/SSL #########
TLSCipherSuite HIGH:MEDIUM:+SSLv2 TLSCertificateFile /etc/openldap/certs/ldap.example.com.crt TLSCertificateKeyFile /etc/openldap/certs/ldap.example.com.key TLSCACertificateFile /etc/openldap/certs/ca.crt TLSVerifyClient never
# Access log ############
database bdb suffix cn=accesslog directory /var/lib/openldap-data/accesslog rootdn cn=Manager,cn=accesslog rootpw secret index default eq index reqStart,objectClass,entryCSN,reqResult eq cachesize 3000 idlcachesize 9000
overlay syncprov syncprov-nopresent TRUE syncprov-reloadhint TRUE
# Primary DB Configuration ##########################
database hdb suffix "dc=example,dc=com" rootdn "cn=Manager,dc=example,dc=com" rootpw secret directory /var/lib/openldap-data/example.com cachesize 3000 idlcachesize 9000
# Primary DB Indexes ####################
index objectClass,entryUUID,entryCSN eq index cn,sn,uid eq,sub,pres,approx index sendmailMTAMapName eq,pres index sendmailMTAMapValue eq index sendmailMTAKey eq,sub,pres index sendmailMTAAliasValue eq,pres index sendmailMTAClassName eq index sendmailMTAClassValue eq index exampleUserModifiable eq
# Primary DB SyncRepl #####################
syncrepl rid=1 provider=ldap://ldap.example.com type=refreshAndPersist searchbase="dc=example,dc=com" bindmethod=simple binddn="uid=syncrepl,ou=ldap,dc=example,dc=com" credentials="syncpw" retry="10 +" schemachecking=off logbase="cn=accesslog" syncdata=accesslog logfilter="(reqResult=0)"
mirrormode on serverID 2
# Primary DB Sync Provider ##########################
overlay syncprov syncprov-checkpoint 10 1 syncprov-sessionlog 100
# Primary DB accesslog ######################
overlay accesslog logdb cn=accesslog logops writes logsuccess TRUE logpurge 07+00:00 01+00:00
# ACLs for Primary DB #####################
access to attrs=userPassword by anonymous auth by self write by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by * none
access to dn.subtree="ou=employees,ou=people,dc=example,dc=com" by anonymous auth by self write by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by * none
access to dn.subtree="ou=group,dc=example,dc=com" by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read
access to dn.subtree="ou=ldap,dc=example,dc=com" by anonymous auth by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read
access to dn="ou=sendmail,dc=example,dc=com" by ssf=112 dn.base="uid=sendmail,ou=ldap,dc=example,dc=com" read by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read
access to dn.subtree="dc=example,dc=com" by dn="uid=syncrepl,ou=ldap,dc=example,dc=com" read by dn="uid=pam,ou=ldap,dc=example,dc=com" read by dn.regex="uid=[^,]+,ou=employees,ou=people,dc=example,dc=com" read
access to dn.base="" by * read
______________________________________________ Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org AIM: imthewherd | GoogleTalk: cgseller@gmail.com
Thanks for the reply, Chris. That hadn't occurred to me. It seems that yes, when starting server 2 first and writing to it, replication appears to work fine. That got me thinking perhaps there was an issue with the build or environment issue on the first server, but I can't find any meaningful discrepancy between the systems.
Something I'm curious about though, is the behavior when the serverID is set incorrectly. It's my understanding that in mirror mode, the pair are configured to use the same RID with different SIDs. I have been working with RID 1 and SIDs 1 and 2 for server 1 and 2 respectively. An earlier test where I had neglected to change the SID, (rid 1 with two sid 1) deluged my logs with what looked to be a loop; Gigs of do_syncrep2 Content Sync Refresh Required and subsequent attempts to sync.
I understand that the config is the source of the problem there. However, thinking that the success of this morning's test with server 2 as the 'primary,' I wondered if there was some confusion when using rid 1 and sid 1, so I changed server 1 to sid 3. What I don't understand is that I see the same log entries/behavior with rid 1, sid 3 and sid 2 as I did if both systems were set with sid 1. Can anyone explain that behavior? And of course, Chris, what's your hunch with suspecting that things might work from the second server?
Any input is very much appreciated as my project cannot continue without a functional mirror. Thanks!
-Nick
Chris G. Sellers wrote:
Do you have different behavior if you make the update to ldap server 2 and it tries to replicate to ldap server 1?
Sellers On Mar 4, 2008, at 12:20 PM, Nick Geron wrote:
I'm working with a test mirror mode setup on 2.4.7 with db 4.5.20 and seeing issues with SyncRepl. Specifically, do_syncrepl fails with an initial error 0x10 and subsequent 0x14, though at least one modification is propagated. To put another way:
*Systems using same ldif to populate and running nearly identical slapd.conf files (serverID is the only variance).
- start server 1
- start server 2
- add host attribute to posixAccount entry on s-1
- attribute seen on s-2 but results in the following log
- no other updates successful until server process restarted.
Mar 4 10:46:14 slapd[22999]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE
Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org mailto:chris.sellers@nitle.org AIM: imthewherd | GoogleTalk: cgseller@gmail.com mailto:cgseller@gmail.com
Nick,
My hunch was since you are in N-Way MultiMaster mode, it should work both ways, and I think that nailed where the problem is
You need two sections with syncrepl. One for the replication from A to B, and one for the replication from B to A. Both servers should have both RIDs
See the MirrorMode and N-Way documentation on the Admin24 documentation. You will see you need both servers to have both RID lines, just be sure the serverID is unique.
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
16.5.3.1 is a good reference.
Enjoy
Sellers (p.s. I'm not sure why the first change replicated though - if I understand correctly, it would have all failed in the one direction)
On Mar 5, 2008, at 11:34 AM, Nick Geron wrote:
Thanks for the reply, Chris. That hadn't occurred to me. It seems that yes, when starting server 2 first and writing to it, replication appears to work fine. That got me thinking perhaps there was an issue with the build or environment issue on the first server, but I can't find any meaningful discrepancy between the systems.
Something I'm curious about though, is the behavior when the serverID is set incorrectly. It's my understanding that in mirror mode, the pair are configured to use the same RID with different SIDs. I have been working with RID 1 and SIDs 1 and 2 for server 1 and 2 respectively. An earlier test where I had neglected to change the SID, (rid 1 with two sid 1) deluged my logs with what looked to be a loop; Gigs of do_syncrep2 Content Sync Refresh Required and subsequent attempts to sync.
I understand that the config is the source of the problem there. However, thinking that the success of this morning's test with server 2 as the 'primary,' I wondered if there was some confusion when using rid 1 and sid 1, so I changed server 1 to sid 3. What I don't understand is that I see the same log entries/behavior with rid 1, sid 3 and sid 2 as I did if both systems were set with sid 1. Can anyone explain that behavior? And of course, Chris, what's your hunch with suspecting that things might work from the second server?
Any input is very much appreciated as my project cannot continue without a functional mirror. Thanks!
-Nick
Chris G. Sellers wrote:
Do you have different behavior if you make the update to ldap server 2 and it tries to replicate to ldap server 1?
Sellers On Mar 4, 2008, at 12:20 PM, Nick Geron wrote:
I'm working with a test mirror mode setup on 2.4.7 with db 4.5.20 and seeing issues with SyncRepl. Specifically, do_syncrepl fails with an initial error 0x10 and subsequent 0x14, though at least one modification is propagated. To put another way:
*Systems using same ldif to populate and running nearly identical slapd.conf files (serverID is the only variance).
- start server 1
- start server 2
- add host attribute to posixAccount entry on s-1
- attribute seen on s-2 but results in the following log
- no other updates successful until server process restarted.
Mar 4 10:46:14 slapd[22999]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE
Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org <mailto:chris.sellers@nitle.org
AIM: imthewherd | GoogleTalk: cgseller@gmail.com mailto:cgseller@gmail.com
______________________________________________ Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org AIM: imthewherd | GoogleTalk: cgseller@gmail.com
Chris,
Well, it looks like I should have read that doc a little closer. ;) Still, though I now have two syncrepl blocks copied directly from that doc, I'm still seeing odd behavior that I hope this list can clear it up.
The gist of it is, I see replication to both, but only if I write to one. I hope I'm understanding things correctly when I assume that in the multi-master mirror, each system should be able to take writes, though openldap cannot handle reconciling concurrent writes to each for the same entry (hence the warning about writing to one at time in the admin doc). If I'm wrong here, well, I have any obvious flaw in my logic.
Just for reference, I'm trying to setup a mirror behind a pair of load balancers. Syncrepl should take place on the back-end though. Based on my interpretation, I want to weight one real server so it is in effect the 'master' in a hot standby configuration. I wanted a mirror to mitigate fail-back issues.
S1 ----\ *** 'Primary' server via weighted LB interface. | LB(s)---pub S2-----/
A simple run down of one of my tests goes as follows:
start server 1 start server 2 and verify in logs attempts to do_syncrepl modify entry on one system and verify synchronization to the other.
This is what I see:
start server 1 then 2 and modify 1 - this works start server 2 then 1 and modify 1 - this works start server 1 then 2 and modify 2 - this fails* start server 2 then 1 and modify 2 - this fails
* By fails, I mean that attributes on the modified entry are not in sync, though there are no 'retry' attempts as if both believe they have the most current contextCSN. Again, only the first (of three) modifications is propagated.
Another thing to note, I only see this behavior if I do not use the accesslog overlay. Ergo:
syncrepl rid=001
provider=ldap://server01-ldap.mgt.example.com bindmethod=simple binddn="uid=syncrepl,ou=ldap,dc=example,dc=com" credentials=secret searchbase="dc=example,dc=com" schemachecking=on type=refreshAndPersist retry="10 +"
syncrepl rid=002 provider=ldap://server02-ldap.mgt.example.com bindmethod=simple binddn="uid=syncrepl,ou=ldap,dc=example,dc=com" credentials=secret searchbase="dc=example,dc=com" schemachecking=on type=refreshAndPersist retry="10 +"
mirrormode on serverID 1
If I add logbase and syncfilter, I see the behavior where any serverID other 1 (on either system) results in the looping mentioned previously, when one server is down. In that setup with both servers up, I still see the null_callback issues.
At this point I'm pretty frayed. I don't see how copying something directly out of the admin doc would fail unless I've tickled a new bug.
Perhaps someone could suggest a known good, working config for me to try? I would prefer delta-syncrepl in a mirror, but I'd settle for just syncrepl with a mirror at this point. I started this project with the intention of updating our older slurpd based ldap server 'cluster.'
Again, thanks for any help.
-Nick
Chris G. Sellers wrote:
Nick,
My hunch was since you are in N-Way MultiMaster mode, it should work both ways, and I think that nailed where the problem is
You need two sections with syncrepl. One for the replication from A to B, and one for the replication from B to A. Both servers should have both RIDs
See the MirrorMode and N-Way documentation on the Admin24 documentation. You will see you need both servers to have both RID lines, just be sure the serverID is unique.
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
16.5.3.1 is a good reference.
Enjoy
Sellers (p.s. I'm not sure why the first change replicated though - if I understand correctly, it would have all failed in the one direction)
On Mar 5, 2008, at 11:34 AM, Nick Geron wrote:
Thanks for the reply, Chris. That hadn't occurred to me. It seems that yes, when starting server 2 first and writing to it, replication appears to work fine. That got me thinking perhaps there was an issue with the build or environment issue on the first server, but I can't find any meaningful discrepancy between the systems.
Something I'm curious about though, is the behavior when the serverID is set incorrectly. It's my understanding that in mirror mode, the pair are configured to use the same RID with different SIDs. I have been working with RID 1 and SIDs 1 and 2 for server 1 and 2 respectively. An earlier test where I had neglected to change the SID, (rid 1 with two sid 1) deluged my logs with what looked to be a loop; Gigs of do_syncrep2 Content Sync Refresh Required and subsequent attempts to sync.
I understand that the config is the source of the problem there. However, thinking that the success of this morning's test with server 2 as the 'primary,' I wondered if there was some confusion when using rid 1 and sid 1, so I changed server 1 to sid 3. What I don't understand is that I see the same log entries/behavior with rid 1, sid 3 and sid 2 as I did if both systems were set with sid 1. Can anyone explain that behavior? And of course, Chris, what's your hunch with suspecting that things might work from the second server?
Any input is very much appreciated as my project cannot continue without a functional mirror. Thanks!
-Nick
Chris G. Sellers wrote:
Do you have different behavior if you make the update to ldap server 2 and it tries to replicate to ldap server 1?
Sellers On Mar 4, 2008, at 12:20 PM, Nick Geron wrote:
I'm working with a test mirror mode setup on 2.4.7 with db 4.5.20 and seeing issues with SyncRepl. Specifically, do_syncrepl fails with an initial error 0x10 and subsequent 0x14, though at least one modification is propagated. To put another way:
*Systems using same ldif to populate and running nearly identical slapd.conf files (serverID is the only variance).
- start server 1
- start server 2
- add host attribute to posixAccount entry on s-1
- attribute seen on s-2 but results in the following log
- no other updates successful until server process restarted.
Mar 4 10:46:14 slapd[22999]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE
Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org mailto:chris.sellers@nitle.org mailto:chris.sellers@nitle.org AIM: imthewherd | GoogleTalk: cgseller@gmail.com mailto:cgseller@gmail.com mailto:cgseller@gmail.com
Chris G. Sellers | NITLE - Technology Team 734.661.2318 | chris.sellers@nitle.org mailto:chris.sellers@nitle.org AIM: imthewherd | GoogleTalk: cgseller@gmail.com mailto:cgseller@gmail.com
Nick Geron wrote:
Chris,
If I add logbase and syncfilter, I see the behavior where any serverID other 1 (on either system) results in the looping mentioned previously, when one server is down. In that setup with both servers up, I still see the null_callback issues.
At this point I'm pretty frayed. I don't see how copying something directly out of the admin doc would fail unless I've tickled a new bug.
Perhaps someone could suggest a known good, working config for me to try? I would prefer delta-syncrepl in a mirror, but I'd settle for just syncrepl with a mirror at this point.
Delta-syncrepl is not supported yet with mirror mode. Support may come in a later release.
Howard Chu wrote:
Nick Geron wrote:
Chris,
If I add logbase and syncfilter, I see the behavior where any serverID other 1 (on either system) results in the looping mentioned previously, when one server is down. In that setup with both servers up, I still see the null_callback issues.
At this point I'm pretty frayed. I don't see how copying something directly out of the admin doc would fail unless I've tickled a new bug.
Perhaps someone could suggest a known good, working config for me to try? I would prefer delta-syncrepl in a mirror, but I'd settle for just syncrepl with a mirror at this point.
Delta-syncrepl is not supported yet with mirror mode. Support may come in a later release.
Thanks for the info. Someone might want to update the admin doc. From section 16.5.3.1 - Mirror Mode Configuration - "This is the same as the Set up the provider slapd section, reference delta-syncrepl replication if using delta-syncrepl."
We're now thinking some of our issues may be attributable to time granularity issues. We're seeing missing information on the consumer if multiple successive writes are attempted via a script. If we slow down to human speed or insert sleeps in our test code, this gets a little better. I see that A.2.4 N-Way MultiMaster Replication notes that entryCSNs now record with microseconds, but does this apply to mirrors as well? Can I setup a two node N-Way? Can anyone provide config examples using slapd.conf syntax instead of the cn=config in the admin doc?
Also, though the Syncrepl Details section does provide what it says, could anyone offer a high level overview of the synchonization process? Something like a trace of operations from an initial modification on one server through the propagation to consumers (server 1 gets write request -> updates index -> writes to session log, etc. etc.etc.).
-Nick Geron
Nick Geron wrote:
We're now thinking some of our issues may be attributable to time granularity issues. We're seeing missing information on the consumer if multiple successive writes are attempted via a script. If we slow down to human speed or insert sleeps in our test code, this gets a little better. I see that A.2.4 N-Way MultiMaster Replication notes that entryCSNs now record with microseconds, but does this apply to mirrors as well?
CSNs were extended to microsecond resolution only for the benefit of conflict resolution. For all other purposes, the changecount field ensures sufficient granularity.
Can I setup a two node N-Way?
"2" is certainly a valid value of "N".
Can anyone provide config examples using slapd.conf syntax instead of the cn=config in the admin doc?
Also, though the Syncrepl Details section does provide what it says, could anyone offer a high level overview of the synchonization process? Something like a trace of operations from an initial modification on one server through the propagation to consumers (server 1 gets write request -> updates index -> writes to session log, etc. etc.etc.).
Syncrepl doesn't write session logs. Read RFC4533.
Howard Chu wrote:
Nick Geron wrote:
We're now thinking some of our issues may be attributable to time granularity issues. We're seeing missing information on the consumer if multiple successive writes are attempted via a script. If we slow down to human speed or insert sleeps in our test code, this gets a little better. I see that A.2.4 N-Way MultiMaster Replication notes that entryCSNs now record with microseconds, but does this apply to mirrors as well?
CSNs were extended to microsecond resolution only for the benefit of conflict resolution. For all other purposes, the changecount field ensures sufficient granularity.
In that case, why do we see any difference in propagation between scripted (quick) updates and hand/command line (slow) modifications? Or are you simply saying time is not the issue?
For example - manipulating one particular entry:
1) update server 1 adding 1 attribute = propagates to second server * wait a few seconds 2) update server 1 adding 4 attributes = first of four propagates to second server
After waiting a second or so, another successful operation on the 'write' server will propagate all modifications over to the second server as expected. This behavior is why we suspected a time granularity issue. It should be noted that this doesn't work for us (and others as I would expect) as there is no guarantee that another operation on the 'write' server will occur, thereby propagating the current entry.
Can I setup a two node N-Way?
"2" is certainly a valid value of "N".
Well, there's that developer 'charm' I've been reading throughout years of archives. Since the admin doc make a distinction between the 'hybrid configuration' of MirrorMode and N-Way Multi-Master, I was more looking for clarification between the two implementations. Not a jackass comment.
Syncrepl doesn't write session logs. Read RFC4533.
I'll look into it. Thanks.
Switching gears, what would the devs say is the capabilities in operations per second with 2.4.7? I'm seeing a number of aborts when testing under high load. The latest came from running scripted ldapsearches and ldapmodifies which resulted in a mutex error (or so I am told by one of our developers).
Specifically:
1) adding about 100 attributes to an entry 2) diffing the output of ldapsearch between the two nodes in loop 3) once synced, grabbing the attributes, shoving them in a temp file with delete instructions and using that with ldapmodify.
I complied with debugging on which results in an abort with "connection.c: 676: connection_state_closing: Assertion 'c_struct_state == 0x02' failed" logged.
When tracing is employed (using strace) I have yet to see this behavior.
Nick Geron wrote:
Howard Chu wrote:
Nick Geron wrote:
We're now thinking some of our issues may be attributable to time granularity issues. We're seeing missing information on the consumer if multiple successive writes are attempted via a script. If we slow down to human speed or insert sleeps in our test code, this gets a little better. I see that A.2.4 N-Way MultiMaster Replication notes that entryCSNs now record with microseconds, but does this apply to mirrors as well?
CSNs were extended to microsecond resolution only for the benefit of conflict resolution. For all other purposes, the changecount field ensures sufficient granularity.
In that case, why do we see any difference in propagation between scripted (quick) updates and hand/command line (slow) modifications? Or are you simply saying time is not the issue?
Timestamps are not the issue for propagation.
For example - manipulating one particular entry:
- update server 1 adding 1 attribute = propagates to second server
- wait a few seconds
- update server 1 adding 4 attributes = first of four propagates to
second server
After waiting a second or so, another successful operation on the 'write' server will propagate all modifications over to the second server as expected. This behavior is why we suspected a time granularity issue. It should be noted that this doesn't work for us (and others as I would expect) as there is no guarantee that another operation on the 'write' server will occur, thereby propagating the current entry.
OK, this sounds like the background thread to propagate updates isn't getting scheduled when it should. That could be a bug in the syncprov overlay.
Can I setup a two node N-Way?
"2" is certainly a valid value of "N".
Well, there's that developer 'charm' I've been reading throughout years of archives. Since the admin doc make a distinction between the 'hybrid configuration' of MirrorMode and N-Way Multi-Master, I was more looking for clarification between the two implementations.
Then that is what you should have asked. "Looking for clarification between the implementation of MirrorMode and Multi-Master" is a much clearer question than "Can I setup a two node N-Way", and there is no way one could logically get from the latter to the former, based on the context of your email. If you don't ask useful questions, you have only yourself to blame when you don't get useful answers.
There is no difference now between the MirrorMode and Multi-Master code. The only difference is purely a matter of usage. In a MirrorMode setup you use an external frontend that guarantees that writes are only directed to one server. As long as that guarantee is kept, your servers will have perfect data consistency. In a Multi-Master setup, you allow writes to any server, and the data consistency is not guaranteed. In that case the CSNs are used for conflict resolution; when competing writes are made to the same entries the last writer wins. (Note - the servers will all eventually converge on a consistent view of the data, the issue is that the resulting data may not resemble what you expected. If your servers' clocks are not tightly synchronized, it's pretty certain to be different from what you expected.)
Syncrepl doesn't write session logs. Read RFC4533.
I'll look into it. Thanks.
Switching gears, what would the devs say is the capabilities in operations per second with 2.4.7?
I've recently run back-hdb with a 5GB database in back-hdb, 20,000 indexed searches/second concurrent with 13,000 modifies/second on an 8 core Opteron server (1.9GHz cores). This was tested using slamd and ~80 client threads, sustained over a 2 hour run.
I'm seeing a number of aborts when testing under high load. The latest came from running scripted ldapsearches and ldapmodifies which resulted in a mutex error (or so I am told by one of our developers).
Specifically:
- adding about 100 attributes to an entry
- diffing the output of ldapsearch between the two nodes in loop
- once synced, grabbing the attributes, shoving them in a temp file
with delete instructions and using that with ldapmodify.
I complied with debugging on which results in an abort with "connection.c: 676: connection_state_closing: Assertion 'c_struct_state == 0x02' failed" logged.
Interesting. It would be useful to get a gdb stack trace from that situation.
Howard Chu wrote:
Nick Geron wrote:
Howard Chu wrote:
Nick Geron wrote:
We're now thinking some of our issues may be attributable to time granularity issues. We're seeing missing information on the consumer if multiple successive writes are attempted via a script. If we slow down to human speed or insert sleeps in our test code, this gets a little better. I see that A.2.4 N-Way MultiMaster Replication notes that entryCSNs now record with microseconds, but does this apply to mirrors as well?
CSNs were extended to microsecond resolution only for the benefit of conflict resolution. For all other purposes, the changecount field ensures sufficient granularity.
In that case, why do we see any difference in propagation between scripted (quick) updates and hand/command line (slow) modifications? Or are you simply saying time is not the issue?
Timestamps are not the issue for propagation.
For example - manipulating one particular entry:
- update server 1 adding 1 attribute = propagates to second server
- wait a few seconds
- update server 1 adding 4 attributes = first of four propagates to
second server
After waiting a second or so, another successful operation on the 'write' server will propagate all modifications over to the second server as expected. This behavior is why we suspected a time granularity issue. It should be noted that this doesn't work for us (and others as I would expect) as there is no guarantee that another operation on the 'write' server will occur, thereby propagating the current entry.
OK, this sounds like the background thread to propagate updates isn't getting scheduled when it should. That could be a bug in the syncprov overlay.
Should I file a report, and if so, what information is required from this end?
Can I setup a two node N-Way?
"2" is certainly a valid value of "N".
Well, there's that developer 'charm' I've been reading throughout years of archives. Since the admin doc make a distinction between the 'hybrid configuration' of MirrorMode and N-Way Multi-Master, I was more looking for clarification between the two implementations.
Then that is what you should have asked. "Looking for clarification between the implementation of MirrorMode and Multi-Master" is a much clearer question than "Can I setup a two node N-Way", and there is no way one could logically get from the latter to the former, based on the context of your email. If you don't ask useful questions, you have only yourself to blame when you don't get useful answers.
There is no difference now between the MirrorMode and Multi-Master code. The only difference is purely a matter of usage. In a MirrorMode setup you use an external frontend that guarantees that writes are only directed to one server. As long as that guarantee is kept, your servers will have perfect data consistency. In a Multi-Master setup, you allow writes to any server, and the data consistency is not guaranteed. In that case the CSNs are used for conflict resolution; when competing writes are made to the same entries the last writer wins. (Note - the servers will all eventually converge on a consistent view of the data, the issue is that the resulting data may not resemble what you expected. If your servers' clocks are not tightly synchronized, it's pretty certain to be different from what you expected.)
I would disagree with the assumption that the thread itself did not provide enough context, but it is silly to belabor this point. Thank you for clearing that up for me.
Syncrepl doesn't write session logs. Read RFC4533.
I'll look into it. Thanks.
Switching gears, what would the devs say is the capabilities in operations per second with 2.4.7?
I've recently run back-hdb with a 5GB database in back-hdb, 20,000 indexed searches/second concurrent with 13,000 modifies/second on an 8 core Opteron server (1.9GHz cores). This was tested using slamd and ~80 client threads, sustained over a 2 hour run.
Thanks for the info. That would certainly point to a problem with my build environment, or a new bug.
I'm seeing a number of aborts when testing under high load. The latest came from running scripted ldapsearches and ldapmodifies which resulted in a mutex error (or so I am told by one of our developers).
Specifically:
- adding about 100 attributes to an entry
- diffing the output of ldapsearch between the two nodes in loop
- once synced, grabbing the attributes, shoving them in a temp file
with delete instructions and using that with ldapmodify.
I complied with debugging on which results in an abort with "connection.c: 676: connection_state_closing: Assertion 'c_struct_state == 0x02' failed" logged.
Interesting. It would be useful to get a gdb stack trace from that situation.
Yesterday I was able to successfully reproduce this beahvior at least three times. This morning, I was able to reproduce it with the above steps yet again. From a gdb session, no backtrace was available, however. I then recompiled with debugging enabled and was unable to reproduce the bug until I added '-d 7' to the run arguments. It should be noted that before recompiling, I was able to reproduce the behavior with and without the command line debug argument.
Here's the stack trace from a gdb session with arguments, -h 'ldap:/// ldaps:///' -d 7
=> acl_string_expand: expanded: uid=[^,]+,ou=employees,ou=people,dc=example,dc=com => regex_matches: string: uid=syncrepl,ou=ldap,dc=example,dc=com => regex_matches: rc: 1 no matches slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1124096320 (LWP 7301)] 0x0000003918230055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003918230055 in raise () from /lib64/libc.so.6 #1 0x0000003918231af0 in abort () from /lib64/libc.so.6 #2 0x0000003918229756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000042d345 in connection_state_closing () #4 0x000000000043d43b in slap_freeself_cb () #5 0x000000000043ef81 in slap_send_search_entry () #6 0x00000000004c88c4 in syncprov_initialize () #7 0x00002aaaaaabc1c7 in ldap_int_thread_pool_wrapper (xpool=0x1a0b06d0) at tpool.c:625 #8 0x00000039196062f7 in start_thread () from /lib64/libpthread.so.0 #9 0x00000039182ce85d in clone () from /lib64/libc.so.6 (gdb)
Envrionment = CentOS 5 updated to whatever RH thinks is current as of last week. Oracle db 4.5.20 and openldap 2.4.7 compiled by hand.
Please let me know if any further information would be helpful.
-Nick
Nick Geron wrote:
Howard Chu wrote:
OK, this sounds like the background thread to propagate updates isn't getting scheduled when it should. That could be a bug in the syncprov overlay.
Should I file a report, and if so, what information is required from this end?
I've filed this as ITS#5405 and fixed it in CVS HEAD. It would be good if you could test the patch and report your results as a followup to the ITS.
Thanks for the info. That would certainly point to a problem with my build environment, or a new bug.
I'm seeing a number of aborts when testing under high load. The latest came from running scripted ldapsearches and ldapmodifies which resulted in a mutex error (or so I am told by one of our developers).
Specifically:
- adding about 100 attributes to an entry
- diffing the output of ldapsearch between the two nodes in loop
- once synced, grabbing the attributes, shoving them in a temp file
with delete instructions and using that with ldapmodify.
I complied with debugging on which results in an abort with "connection.c: 676: connection_state_closing: Assertion 'c_struct_state == 0x02' failed" logged.
Interesting. It would be useful to get a gdb stack trace from that situation.
Yesterday I was able to successfully reproduce this beahvior at least three times. This morning, I was able to reproduce it with the above steps yet again. From a gdb session, no backtrace was available, however. I then recompiled with debugging enabled and was unable to reproduce the bug until I added '-d 7' to the run arguments. It should be noted that before recompiling, I was able to reproduce the behavior with and without the command line debug argument.
Here's the stack trace from a gdb session with arguments, -h 'ldap:/// ldaps:///' -d 7
=> acl_string_expand: expanded: uid=[^,]+,ou=employees,ou=people,dc=example,dc=com => regex_matches: string: uid=syncrepl,ou=ldap,dc=example,dc=com => regex_matches: rc: 1 no matches slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1124096320 (LWP 7301)] 0x0000003918230055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003918230055 in raise () from /lib64/libc.so.6 #1 0x0000003918231af0 in abort () from /lib64/libc.so.6 #2 0x0000003918229756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000042d345 in connection_state_closing () #4 0x000000000043d43b in slap_freeself_cb () #5 0x000000000043ef81 in slap_send_search_entry () #6 0x00000000004c88c4 in syncprov_initialize () #7 0x00002aaaaaabc1c7 in ldap_int_thread_pool_wrapper (xpool=0x1a0b06d0) at tpool.c:625 #8 0x00000039196062f7 in start_thread () from /lib64/libpthread.so.0 #9 0x00000039182ce85d in clone () from /lib64/libc.so.6 (gdb)
Envrionment = CentOS 5 updated to whatever RH thinks is current as of last week. Oracle db 4.5.20 and openldap 2.4.7 compiled by hand.
Please let me know if any further information would be helpful.
Your stack trace is a bit odd, because I can't find anywhere in the source tree that uses the function "slap_freeself_cb()". Are you using any custom overlays? It appears that your stack trace is still missing a lot of details. You should compile with -g and without any optimization, and make sure you're testing with the unstripped binary.
Howard Chu wrote:
Nick Geron wrote:
Howard Chu wrote:
OK, this sounds like the background thread to propagate updates isn't getting scheduled when it should. That could be a bug in the syncprov overlay.
Should I file a report, and if so, what information is required from this end?
I've filed this as ITS#5405 and fixed it in CVS HEAD. It would be good if you could test the patch and report your results as a followup to the ITS.
I'll try to look into this soon.
Thanks for the info. That would certainly point to a problem with my build environment, or a new bug.
I'm seeing a number of aborts when testing under high load. The latest came from running scripted ldapsearches and ldapmodifies which resulted in a mutex error (or so I am told by one of our developers).
Specifically:
- adding about 100 attributes to an entry
- diffing the output of ldapsearch between the two nodes in loop
- once synced, grabbing the attributes, shoving them in a temp file
with delete instructions and using that with ldapmodify.
I complied with debugging on which results in an abort with "connection.c: 676: connection_state_closing: Assertion 'c_struct_state == 0x02' failed" logged.
Interesting. It would be useful to get a gdb stack trace from that situation.
Yesterday I was able to successfully reproduce this beahvior at least three times. This morning, I was able to reproduce it with the above steps yet again. From a gdb session, no backtrace was available, however. I then recompiled with debugging enabled and was unable to reproduce the bug until I added '-d 7' to the run arguments. It should be noted that before recompiling, I was able to reproduce the behavior with and without the command line debug argument.
Here's the stack trace from a gdb session with arguments, -h 'ldap:/// ldaps:///' -d 7
=> acl_string_expand: expanded: uid=[^,]+,ou=employees,ou=people,dc=example,dc=com => regex_matches: string: uid=syncrepl,ou=ldap,dc=example,dc=com => regex_matches: rc: 1 no matches slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1124096320 (LWP 7301)] 0x0000003918230055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003918230055 in raise () from /lib64/libc.so.6 #1 0x0000003918231af0 in abort () from /lib64/libc.so.6 #2 0x0000003918229756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000042d345 in connection_state_closing () #4 0x000000000043d43b in slap_freeself_cb () #5 0x000000000043ef81 in slap_send_search_entry () #6 0x00000000004c88c4 in syncprov_initialize () #7 0x00002aaaaaabc1c7 in ldap_int_thread_pool_wrapper (xpool=0x1a0b06d0) at tpool.c:625 #8 0x00000039196062f7 in start_thread () from /lib64/libpthread.so.0 #9 0x00000039182ce85d in clone () from /lib64/libc.so.6 (gdb)
Envrionment = CentOS 5 updated to whatever RH thinks is current as of last week. Oracle db 4.5.20 and openldap 2.4.7 compiled by hand.
Please let me know if any further information would be helpful.
Your stack trace is a bit odd, because I can't find anywhere in the source tree that uses the function "slap_freeself_cb()". Are you using any custom overlays? It appears that your stack trace is still missing a lot of details. You should compile with -g and without any optimization, and make sure you're testing with the unstripped binary.
The only overlays I'm using are those found in the openldap tarball. I found that function referenced in result.c
[root@server01 openldap-2.4.7]# grep -R slap_freeself_cb . Binary file ./servers/slapd/.libs/slapd matches ./servers/slapd/proto-slap.h:LDAP_SLAPD_F (int) slap_freeself_cb LDAP_P(( Operation *op, SlapReply *rs )); Binary file ./servers/slapd/result.o matches ./servers/slapd/result.c:int slap_freeself_cb( Operation *op, SlapReply *rs ) [root@server01 openldap-2.4.7]# strings /usr/local/libexec/slapd | grep freeself slap_freeself_cb slap_freeself_cb [root@server01 openldap-2.4.7]#
It does look like I had stripped binaries though. I've removed optimization and verified that the -g option is being given. It took a bit to track down where stripping was being set, but 'file' now tells me the binaries are not stripped.
Here's the latest backtrace:
regex_matches: string: uid=syncrepl,ou=ldap,dc=corenap,dc=com
=> regex_matches: rc: 1 no matches ldap_read: want=8, got=7 0000: 30 05 02 01 03 42 00 0....B. ber_get_next: tag 0x30 len 5 contents: ber_get_next ldap_read: want=8, got=0
ber_get_next on fd 21 failed errno=0 (Success) connection_closing: readying conn=1 sd=21 for close connection_close: deferring conn=1 sd=21 conn=1 op=2 do_unbind connection_resched: attempting closing conn=1 sd=21 connection_close: conn=1 sd=21 slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1107310912 (LWP 17114)] 0x00000030ae430055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000030ae430055 in raise () from /lib64/libc.so.6 #1 0x00000030ae431af0 in abort () from /lib64/libc.so.6 #2 0x00000030ae429756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000043854a in connection_state_closing (c=0xa35ef60) at connection.c:676 #4 0x000000000044dc38 in send_ldap_ber (conn=0xa35ef60, ber=0x42002410) at result.c:153 #5 0x000000000045173d in slap_send_search_entry (op=0x420029a0, rs=0x420026c0) at result.c:1187 #6 0x000000000051072f in syncprov_sendresp (op=0x420029a0, opc=0x42002780, so=0xa882b80, e=0x420027d8, mode=2) at syncprov.c:809 #7 0x0000000000510a8d in syncprov_qplay (op=0x420029a0, on=0xa094690, so=0xa882b80) at syncprov.c:878 #8 0x0000000000510c54 in syncprov_qtask (ctx=0x42002dc0, arg=0x2aaac4111500) at syncprov.c:923 #9 0x00002aaaaaabc605 in ldap_int_thread_pool_wrapper (xpool=0xa0466d0) at tpool.c:625 #10 0x00000030aec062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x00000030ae4ce85d in clone () from /lib64/libc.so.6 (gdb)
-Nick
Nick Geron wrote:
Howard Chu wrote:
Your stack trace is a bit odd, because I can't find anywhere in the source tree that uses the function "slap_freeself_cb()". Are you using any custom overlays? It appears that your stack trace is still missing a lot of details. You should compile with -g and without any optimization, and make sure you're testing with the unstripped binary.
The only overlays I'm using are those found in the openldap tarball. I found that function referenced in result.c
It is defined in result.c, but it is not referenced anywhere.
It does look like I had stripped binaries though. I've removed optimization and verified that the -g option is being given. It took a bit to track down where stripping was being set, but 'file' now tells me the binaries are not stripped.
Here's the latest backtrace:
regex_matches: string: uid=syncrepl,ou=ldap,dc=corenap,dc=com
=> regex_matches: rc: 1 no matches ldap_read: want=8, got=7 0000: 30 05 02 01 03 42 00 0....B. ber_get_next: tag 0x30 len 5 contents: ber_get_next ldap_read: want=8, got=0
ber_get_next on fd 21 failed errno=0 (Success) connection_closing: readying conn=1 sd=21 for close connection_close: deferring conn=1 sd=21 conn=1 op=2 do_unbind connection_resched: attempting closing conn=1 sd=21 connection_close: conn=1 sd=21 slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1107310912 (LWP 17114)] 0x00000030ae430055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000030ae430055 in raise () from /lib64/libc.so.6 #1 0x00000030ae431af0 in abort () from /lib64/libc.so.6 #2 0x00000030ae429756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000043854a in connection_state_closing (c=0xa35ef60) at connection.c:676 #4 0x000000000044dc38 in send_ldap_ber (conn=0xa35ef60, ber=0x42002410) at result.c:153 #5 0x000000000045173d in slap_send_search_entry (op=0x420029a0, rs=0x420026c0) at result.c:1187 #6 0x000000000051072f in syncprov_sendresp (op=0x420029a0, opc=0x42002780, so=0xa882b80, e=0x420027d8, mode=2) at syncprov.c:809 #7 0x0000000000510a8d in syncprov_qplay (op=0x420029a0, on=0xa094690, so=0xa882b80) at syncprov.c:878 #8 0x0000000000510c54 in syncprov_qtask (ctx=0x42002dc0, arg=0x2aaac4111500) at syncprov.c:923 #9 0x00002aaaaaabc605 in ldap_int_thread_pool_wrapper (xpool=0xa0466d0) at tpool.c:625 #10 0x00000030aec062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x00000030ae4ce85d in clone () from /lib64/libc.so.6 (gdb)
This trace looks the same as ITS#5401. Apparently syncprov is trying to send a persistent search response after the psearch consumer has closed the connection. Obviously this should never happen; when a connection is closed all the outstanding operations on it are abandoned.
You should probably send further replies to ITS#5401. Can you please provide, in addition to the above trace, the output from frame 3 print *c
Howard Chu wrote:
Nick Geron wrote:
Howard Chu wrote:
Your stack trace is a bit odd, because I can't find anywhere in the source tree that uses the function "slap_freeself_cb()". Are you using any custom overlays? It appears that your stack trace is still missing a lot of details. You should compile with -g and without any optimization, and make sure you're testing with the unstripped binary.
The only overlays I'm using are those found in the openldap tarball. I found that function referenced in result.c
It is defined in result.c, but it is not referenced anywhere.
It does look like I had stripped binaries though. I've removed optimization and verified that the -g option is being given. It took a bit to track down where stripping was being set, but 'file' now tells me the binaries are not stripped.
Here's the latest backtrace:
regex_matches: string: uid=syncrepl,ou=ldap,dc=corenap,dc=com
=> regex_matches: rc: 1 no matches ldap_read: want=8, got=7 0000: 30 05 02 01 03 42 00 0....B. ber_get_next: tag 0x30 len 5 contents: ber_get_next ldap_read: want=8, got=0
ber_get_next on fd 21 failed errno=0 (Success) connection_closing: readying conn=1 sd=21 for close connection_close: deferring conn=1 sd=21 conn=1 op=2 do_unbind connection_resched: attempting closing conn=1 sd=21 connection_close: conn=1 sd=21 slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1107310912 (LWP 17114)] 0x00000030ae430055 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000030ae430055 in raise () from /lib64/libc.so.6 #1 0x00000030ae431af0 in abort () from /lib64/libc.so.6 #2 0x00000030ae429756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000043854a in connection_state_closing (c=0xa35ef60) at connection.c:676 #4 0x000000000044dc38 in send_ldap_ber (conn=0xa35ef60, ber=0x42002410) at result.c:153 #5 0x000000000045173d in slap_send_search_entry (op=0x420029a0, rs=0x420026c0) at result.c:1187 #6 0x000000000051072f in syncprov_sendresp (op=0x420029a0, opc=0x42002780, so=0xa882b80, e=0x420027d8, mode=2) at syncprov.c:809 #7 0x0000000000510a8d in syncprov_qplay (op=0x420029a0, on=0xa094690, so=0xa882b80) at syncprov.c:878 #8 0x0000000000510c54 in syncprov_qtask (ctx=0x42002dc0, arg=0x2aaac4111500) at syncprov.c:923 #9 0x00002aaaaaabc605 in ldap_int_thread_pool_wrapper (xpool=0xa0466d0) at tpool.c:625 #10 0x00000030aec062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x00000030ae4ce85d in clone () from /lib64/libc.so.6 (gdb)
This trace looks the same as ITS#5401. Apparently syncprov is trying to send a persistent search response after the psearch consumer has closed the connection. Obviously this should never happen; when a connection is closed all the outstanding operations on it are abandoned.
You should probably send further replies to ITS#5401. Can you please provide, in addition to the above trace, the output from frame 3 print *c
=> regex_matches: string: uid=syncrepl,ou=ldap,dc=corenap,dc=com => regex_matches: rc: 1 no matches slapd: connection.c:676: connection_state_closing: Assertion `c->c_struct_state == 0x02' failed.
Program received signal SIGABRT, Aborted. [Switching to Thread 1107310912 (LWP 30979)] 0x0000003918230055 in raise () from /lib64/libc.so.6 (gdb) frame 3 #3 0x000000000043854a in connection_state_closing (c=0xff53f20) at connection.c:676 676 assert( c->c_struct_state == SLAP_C_USED ); (gdb) print *c $1 = {c_struct_state = 1, c_conn_state = 0, c_conn_idx = 20, c_sd = -1, c_close_reason = 0x545d7b "?", c_mutex = {__data = {__lock = 1, __count = 0, __owner = 30979, __nusers = 1, __kind = 0, __spins = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\003y\000\000\001", '\0' <repeats 26 times>, __align = 1}, c_sb = 0x2aaac4000ef0, c_starttime = 0, c_activitytime = 0, c_connid = 18446744073709551615, c_peer_domain = { bv_len = 0, bv_val = 0x0}, c_peer_name = {bv_len = 0, bv_val = 0x0}, c_listener = 0x0, c_sasl_bind_mech = {bv_len = 0, bv_val = 0x0}, c_sasl_dn = {bv_len = 0, bv_val = 0x0}, c_sasl_authz_dn = {bv_len = 0, bv_val = 0x0}, c_authz_backend = 0x0, c_authz_cookie = 0x0, c_authz = { sai_method = 128, sai_mech = {bv_len = 0, bv_val = 0x0}, sai_dn = { bv_len = 0, bv_val = 0x0}, sai_ndn = {bv_len = 0, bv_val = 0x0}, sai_ssf = 0, sai_transport_ssf = 0, sai_tls_ssf = 0, sai_sasl_ssf = 0}, c_protocol = 0, c_ops = {stqh_first = 0x0, stqh_last = 0xff54038}, c_pending_ops = {stqh_first = 0x0, stqh_last = 0xff54048}, c_write_mutex = { __data = {__lock = 1, __count = 0, __owner = 30979, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\003y\000\000\001", '\0' <repeats 26 times>, __align = 1}, c_write_cv = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\0' <repeats 47 times>, __align = 0}, c_currentber = 0x0, c_sasl_bind_in_progress = 0 '\0', c_writewaiter = 0 '\0', c_is_tls = 0 '\0', c_needs_tls_accept = 0 '\0', c_sasl_layers = 0 '\0', c_sasl_done = 0 '\0', c_sasl_authctx = 0x0, c_sasl_sockctx = 0x0, c_sasl_extra = 0x0, c_sasl_bindop = 0x0, c_pagedresults_state = {ps_be = 0x0, ps_size = 0, ps_count = 0, ps_cookie = 0, ps_cookieval = {bv_len = 0, bv_val = 0x0}}, c_n_ops_received = 3, c_n_ops_executing = 0, c_n_ops_pending = 0, c_n_ops_completed = 3, c_n_get = 3, c_n_read = 3, c_n_write = 0, c_extensions = 0x0, c_clientfunc = 0, c_clientarg = 0x0, c_send_ldap_result = 0x44ebfc <slap_send_ldap_result>, c_send_search_entry = 0x44f994 <slap_send_search_entry>, c_send_search_reference = 0x451ad1 <slap_send_search_reference>, c_send_ldap_extended = 0x44f4c8 <slap_send_ldap_extended>, c_send_ldap_intermediate = 0x44f771 <slap_send_ldap_intermediate>} (gdb) (gdb) bt #0 0x0000003918230055 in raise () from /lib64/libc.so.6 #1 0x0000003918231af0 in abort () from /lib64/libc.so.6 #2 0x0000003918229756 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000043854a in connection_state_closing (c=0xff53f20) at connection.c:676 #4 0x000000000044dc38 in send_ldap_ber (conn=0xff53f20, ber=0x42002410) at result.c:153 #5 0x000000000045173d in slap_send_search_entry (op=0x420029a0, rs=0x420026c0) at result.c:1187 #6 0x000000000051072f in syncprov_sendresp (op=0x420029a0, opc=0x42002780, so=0x101f9a50, e=0x420027d8, mode=2) at syncprov.c:809 #7 0x0000000000510a8d in syncprov_qplay (op=0x420029a0, on=0xfc89690, so=0x101f9a50) at syncprov.c:878 #8 0x0000000000510c54 in syncprov_qtask (ctx=0x42002dc0, arg=0x105542c0) at syncprov.c:923 #9 0x00002aaaaaabc605 in ldap_int_thread_pool_wrapper (xpool=0xfc3b6d0) at tpool.c:625 #10 0x00000039196062f7 in start_thread () from /lib64/libpthread.so.0 #11 0x00000039182ce85d in clone () from /lib64/libc.so.6
openldap-software@openldap.org