Hi,
I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.
* I upgraded the 4 masters to 2.4.33 and kept the syncrepl configuration. The syncrepl masters were using RefreshAndPersist while the slave consumers were using RefreshOnly. * After the upgrade the 2.4.33 masters began locking up, not refusing connections, but not returning queries--this would happen 3-4 per day. When one master locked all the masters would lock. Slaves appear to not be affected by this. * I downgraded back 2.4.23 in all of the masters only to have the lock-ups continue. * I slapcat'ed the database on one master and blew away the databases on all the other masters and slaves and rebuilt everything. I rebuilt one master and one slave and rsync'ed the slapd.d directory where needed. Then I started each master one-by-one to validate that they mirrored the databases correctly. Then I repeated this on the slaves. Unfortunately the masters would continue to lock up as above. * So, seeing that the lock-ups were occurring regardless of the openldap version I decided to go back to 2.4.33 and make the move to delta-replication. * This past weekend I finally got delta-replication working. I did the slapcat-rebuild slapd.d-slapadd on one master and rsync'ed slapd.d to each master one at a time. All was well and all databases were in perfect sync. * Unfortunately the masters would continue to lock, accepting connections but never servicing the request so all queries would hang.
Looking at this again today I noticed that my masters were all running at near 100% CPU but continuing to service queries. Depending on the # of CPUs only one or two threads would be running this high. Using strace -tt -p <pid-ofthread>, this is what would be spewing out:
18:52:05.713266 sched_yield() = 0 18:52:05.713323 sched_yield() = 0 18:52:05.713380 sched_yield() = 0 18:52:05.713438 sched_yield() = 0 18:52:05.713495 sched_yield() = 0 18:52:05.713553 sched_yield() = 0 18:52:05.713611 sched_yield() = 0 18:52:05.713668 sched_yield() = 0 18:52:05.713726 sched_yield() = 0 18:52:05.713783 sched_yield() = 0 18:52:05.713840 sched_yield() = 0 18:52:05.713898 sched_yield() = 0
I haven't correlated this to the slapd daemons hanging, yet.
There is nothing interesting in the logs when the slapd daemons would hang. Again when one master hangs they all would hang. I would restart each master one by one and on occasions when one master restarted the others would start servicing again. Other times it would take two or three restarts to get all of the masters servicing again. The only gain with delta-replication is that they only hang once a day now and usually after I had gone home.
For now I have implemented a small script that is run from cron every two minutes to test the slapd daemons if they are hung doing a simple ldapsearch and if so then restart the slapd daemon. This is done on all four masters. My database is not large at all with only ~100 users but it is critical as it is the backend authentication for everything including the remote access.
Here is the slapcat of my cn=config database (minus the schemas and operational attributes). It is a fairly typical delta-replication configuration. The accesslogs use hdb as that is what most (all) of the accesslogs examples show. The main database is bdb.
Any suggestions would be greatly appreciated.
Regards, Bob --bs
dn: cn=config objectClass: olcGlobal cn: config olcConfigFile: slapd.conf olcConfigDir: slapd.d olcArgsFile: /var/run/openldap/slapd.args olcAttributeOptions: lang- olcAuthzPolicy: none olcConcurrency: 0 olcConnMaxPendingAuth: 1000 olcGentleHUP: FALSE olcIdleTimeout: 0 olcIndexSubstrIfMaxLen: 4 olcIndexSubstrIfMinLen: 2 olcIndexSubstrAnyLen: 4 olcIndexSubstrAnyStep: 2 olcIndexIntLen: 4 olcLocalSSF: 71 olcPidFile: /var/run/openldap/slapd.pid olcReadOnly: FALSE olcSaslSecProps: noplain,noanonymous olcSecurity: tls=1 olcServerID: 1 ldap://auth1noc.man.o3b.local olcServerID: 2 ldap://auth2noc.man.o3b.local olcServerID: 3 ldap://auth1noc.btz.o3b.local olcServerID: 4 ldap://auth2noc.btz.o3b.local olcServerID: 5 ldap://auth1gw.nma.o3b.local olcServerID: 6 ldap://auth2gw.nma.o3b.local olcServerID: 7 ldap://auth1gw.sun.o3b.local olcServerID: 8 ldap://auth2gw.sun.o3b.local olcServerID: 9 ldap://auth1gw.per.o3b.local olcServerID: 10 ldap://auth2gw.per.o3b.local olcSockbufMaxIncoming: 262143 olcSockbufMaxIncomingAuth: 16777215 olcThreads: 16 olcTLSCipherSuite: HIGH:MEDIUM:SSLv2 olcTLSCertificateFile: /etc/openldap/cacerts/auth-o3b.crt olcTLSCertificateKeyFile: /etc/openldap/cacerts/auth-o3b.key olcTLSCRLCheck: none olcToolThreads: 1 olcWriteTimeout: 0 olcTLSCACertificateFile: /etc/pki/tls/certs/o3b-master-ca.crt olcTLSVerifyClient: never olcLogLevel: sync olcConnMaxPending: 101
dn: cn=module{0},cn=config objectClass: olcModuleList cn: module{0} olcModulePath: /usr/lib64/openldap olcModuleLoad: {0}syncprov.la olcModuleLoad: {1}memberof.la olcModuleLoad: {2}ppolicy.la olcModuleLoad: {3}accesslog.la
dn: olcDatabase={-1}frontend,cn=config objectClass: olcDatabaseConfig objectClass: olcFrontendConfig olcDatabase: {-1}frontend olcAccess: {0}to dn.base="" by * read olcAccess: {1}to dn.subtree="cn=monitor" by dn.base="cn=rootdn,dc=o3bnetworks .net" read olcAccess: {2}to dn.base="cn=subschema" by * read olcAddContentAcl: FALSE olcLastMod: TRUE olcMaxDerefDepth: 0 olcReadOnly: FALSE olcSchemaDN: cn=Subschema olcSecurity: tls=1 olcMonitoring: FALSE olcPasswordHash: {SSHA}
dn: olcDatabase={0}config,cn=config objectClass: olcDatabaseConfig olcDatabase: {0}config olcAccess: {0}to * by dn.base="cn=rootdn,dc=o3bnetworks.net" write by dn.bas e="cn=syncdn,dc=o3bnetworks.net" read by * none olcAddContentAcl: TRUE olcLastMod: TRUE olcLimits: {0}dn.base="cn=rootdn,dc=o3bnetworks.net" size=unlimited time=unli mited olcLimits: {1}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited time=unli mited olcMaxDerefDepth: 15 olcReadOnly: FALSE olcRootDN: cn=config olcMirrorMode: TRUE olcMonitoring: FALSE olcRootPW:: *** olcSyncrepl: {0}rid=001 provider=ldap://auth1noc.man.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)( reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=002 provider=ldap://auth2noc.man.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)( reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=003 provider=ldap://auth1noc.btz.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)( reqResult=0))" syncdata=accesslog olcSyncrepl: {3}rid=004 provider=ldap://auth2noc.btz.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="cn=config" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)( reqResult=0))" syncdata=accesslog
dn: olcOverlay={0}syncprov,olcDatabase={0}config,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {1}syncprov olcSpCheckpoint: 1000 60
dn: olcOverlay={1}accesslog,olcDatabase={0}config,cn=config objectClass: olcOverlayConfig objectClass: olcAccessLogConfig olcOverlay: {1}accesslog olcAccessLogDB: cn=accesslog olcAccessLogOps: writes olcAccessLogSuccess: TRUE olcAccessLogPurge: 2+00:00 1+00:00
dn: olcDatabase={1}hdb,cn=config objectClass: olcDatabaseConfig objectClass: olcConfig objectClass: top objectClass: olcHdbConfig olcDbDirectory: /var/lib/ldap/accesslog olcSuffix: cn=accesslog olcDbConfig: [Deleted] aXIgLXEgb3B0aW9uKS4g olcAddContentAcl: FALSE olcDbCacheFree: 1 olcDbCacheSize: 1000 olcAccess: {0}to * by self write by dn.base="cn=rootdn,dc=o3bnetworks.net" r ead by dn.base="cn=authdn,dc=o3bnetworks.net" read by dn.base="cn=syncdn,dc= o3bnetworks.net" read olcDbDirtyRead: FALSE olcDbIDLcacheSize: 0 olcDbDNcacheSize: 0 olcDbIndex: default eq olcMaxDerefDepth: 15 olcLimits: {0}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited time=unli mited olcDbSearchStack: 16 olcLastMod: TRUE olcDbLinearIndex: FALSE olcDbMode: 0600 olcDbNoSync: FALSE olcDbShmKey: 0 olcReadOnly: FALSE olcSecurity: tls=1 olcRootDN: cn=accesslogdn olcDatabase: {1}hdb
dn: olcOverlay={0}syncprov,olcDatabase={1}hdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpNoPresent: TRUE olcSpReloadHint: TRUE
dn: olcDatabase={3}monitor,cn=config objectClass: olcDatabaseConfig olcAddContentAcl: FALSE olcLastMod: TRUE olcMaxDerefDepth: 15 olcReadOnly: FALSE olcRootDN: cn=monitor,cn=Monitor olcRootPW:: bW9uaXRvcg== olcSecurity: tls=1 olcMonitoring: FALSE olcDatabase: {3}monitor
dn: olcDatabase={3}bdb,cn=config objectClass: olcDatabaseConfig objectClass: olcBdbConfig olcSuffix: dc=o3bnetworks.net olcAddContentAcl: FALSE olcLastMod: TRUE olcLimits: {0}dn.base="cn=syncdn,dc=o3bnetworks.net" size=unlimited time=unli mited olcMaxDerefDepth: 15 olcReadOnly: FALSE olcRootDN: cn=rootdn,dc=o3bnetworks.net olcRootPW:: *** olcSecurity: tls=1 olcMirrorMode: TRUE olcMonitoring: TRUE olcDbDirectory: /var/lib/ldap olcDbConfig: [Deleted] olcDbNoSync: FALSE olcDbDirtyRead: FALSE olcDbIDLcacheSize: 0 olcDbIndex: objectClass pres,eq olcDbIndex: cn pres,eq,sub olcDbIndex: uid pres,eq,sub olcDbIndex: uidNumber pres,eq olcDbIndex: gidNumber pres,eq olcDbIndex: memberUid pres,eq,sub olcDbIndex: displayName pres,eq,sub olcDbIndex: sambaSID pres,eq,sub olcDbIndex: sambaDomainName pres,eq olcDbIndex: sambaGroupType pres,eq olcDbIndex: ou pres,eq,sub olcDbIndex: sambaSIDList pres,eq olcDbLinearIndex: FALSE olcDbMode: 0600 olcDbSearchStack: 16 olcDbShmKey: 0 olcDbCacheFree: 1 olcDbDNcacheSize: 0 olcAccess: {0}to * by self write by group/groupOfNames/member.exact="cn=ldap admins,dc=o3bnetworks.net" write by dn.base="cn=authdn,dc=o3bnetworks.net" r ead by dn.base="cn=syncdn,dc=o3bnetworks.net" read by users read by anonym ous read olcDbCacheSize: 1000 olcDatabase: {3}bdb olcSyncrepl: {0}rid=011 provider=ldap://auth1noc.man.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="dc=o3bnetworks.net" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWrit eObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=012 provider=ldap://auth2noc.man.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 search base="dc=o3bnetworks.net" scope=sub schemachecking=off type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter="(&(objectClass=auditWrit eObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=013 provider=ldap://auth1noc.btz.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 filter ="(objectclass=*)" searchbase="dc=o3bnetworks.net" scope=sub schemachecking=o ff type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter= "(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {3}rid=014 provider=ldap://auth2noc.btz.o3b.local bindmethod=simp le binddn="cn=syncdn,dc=o3bnetworks.net" credentials="33jJ9nSkSD" keepalive=0 :5:0 starttls=yes tls_reqcert=allow tls_cipher_suite=HIGH:MEDIUM:SSLv2 filter ="(objectclass=*)" searchbase="dc=o3bnetworks.net" scope=sub schemachecking=o ff type=refreshAndPersist retry="5 5 300 +" logbase="cn=accesslog" logfilter= "(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
dn: olcOverlay={0}memberof,olcDatabase={3}bdb,cn=config objectClass: olcOverlayConfig objectClass: olcMemberOf olcOverlay: {0}memberof olcMemberOfDangling: ignore olcMemberOfRefInt: FALSE
dn: olcOverlay={1}syncprov,olcDatabase={3}bdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {1}syncprov olcSpCheckpoint: 1000 60
dn: olcOverlay={2}ppolicy,olcDatabase={3}bdb,cn=config objectClass: olcOverlayConfig objectClass: olcConfig objectClass: top objectClass: olcPPolicyConfig olcOverlay: {2}ppolicy olcPPolicyDefault: cn=O3b,ou=Password,ou=Policy,dc=o3bnetworks.net
dn: olcOverlay={3}accesslog,olcDatabase={3}bdb,cn=config objectClass: olcOverlayConfig objectClass: olcAccessLogConfig olcOverlay: {3}accesslog olcAccessLogOps: writes olcAccessLogSuccess: TRUE olcAccessLogDB: cn=accesslog olcAccessLogPurge: 2+00:00 1+00:00
--On Thursday, February 21, 2013 8:35 AM -0500 "Robert W. Smith" bob.smith@o3bnetworks.com wrote:
Hi,
I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.
Did you build openldap with debugging symbols? (-g) Did you disable optimization? (-O0)
If so, I would advise submitting an ITS, with a full backtrace from gdb:
http://www.openldap.org/faq/data/cache/59.html
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Thursday, February 21, 2013 9:27 AM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Thursday, February 21, 2013 8:35 AM -0500 "Robert W. Smith" bob.smith@o3bnetworks.com wrote:
Hi,
I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.
Did you build openldap with debugging symbols? (-g) Did you disable optimization? (-O0)
If so, I would advise submitting an ITS, with a full backtrace from gdb:
Oh, also make sure when make install is run, you don't strip the binaries of the debug symbols (make install STRIP="")
I would also warn against using the centos spec file, as it links to NSS instead of OpenSSL. You probably want to look at http://ltb-project.org/wiki/download#openldap
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
Thanks, Quanah,
I'm in the process of building against the latest git pull I did today. I'll take your notes into account. I also found some references regarding epoll() versus select() (ref http://www.openldap.org/lists/openldap-devel/201212/msg00014.html ) and trying to figure if this is an avenue to pursue in a test build.
Bob --bs
On Thu, 2013-02-21 at 13:19 -0500, Quanah Gibson-Mount wrote:
--On Thursday, February 21, 2013 9:27 AM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Thursday, February 21, 2013 8:35 AM -0500 "Robert W. Smith" bob.smith@o3bnetworks.com wrote:
Hi,
I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.
Did you build openldap with debugging symbols? (-g) Did you disable optimization? (-O0)
If so, I would advise submitting an ITS, with a full backtrace from gdb:
Oh, also make sure when make install is run, you don't strip the binaries of the debug symbols (make install STRIP="")
I would also warn against using the centos spec file, as it links to NSS instead of OpenSSL. You probably want to look at http://ltb-project.org/wiki/download#openldap
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc.
Zimbra :: the leader in open source messaging and collaboration
On 02/21/2013 01:19 PM, Quanah Gibson-Mount wrote:
--On Thursday, February 21, 2013 9:27 AM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Thursday, February 21, 2013 8:35 AM -0500 "Robert W. Smith" bob.smith@o3bnetworks.com wrote:
Hi,
I am running a 4-way multi-master configuration with a number of slaves in remote locations. I am currently running openldap 2.4.33 on top of CentOS 6.3 (I built 2.4.33 from a modified base centos 6 spec file). I was originally running the centos base openldap 2.4.23 using N-way multimaster using the syncrepl configuration but I was having problems with the masters and slaves staying in perfect sync--other than this 2.4.23 was running stably since last spring. I'll try to be brief in what has happened since Feb 1.
Did you build openldap with debugging symbols? (-g) Did you disable optimization? (-O0)
If so, I would advise submitting an ITS, with a full backtrace from gdb:
Oh, also make sure when make install is run, you don't strip the binaries of the debug symbols (make install STRIP="")
I would also warn against using the centos spec file, as it links to NSS instead of OpenSSL. You probably want to look at http://ltb-project.org/wiki/download#openldap
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc.
Zimbra :: the leader in open source messaging and collaboration
Quanah,
I re-compiled 2.4.33 using the openssl libraries on Friday and have installed the updated packages on my four master servers. I have not experienced a single deadlock since then. The slaves have not been upgraded yet--they are still running the centos stock 2.4.23 distribution.
The only issue that I still have is one or more threads running at full throttle and utilizing 100% CPU. But this is a different issue and I'll start a new thread on this when I get a chance to look at this more.
Thanks, Bob --bs
openldap-technical@openldap.org