Full_Name: Bill MacAllister Version: 2.4.23-1 OS: Debian Lenny URL: http://www.stanford.edu/~whm/files/ldap-replication-error.txt Submission from: (NULL) (99.55.172.208)
I just replaced a 2.3.43 OpenLDAP master with a 2.4.23-1 master. There are 6 slaves all running 2.4.23-1. After reloading the master database, I confirmed that replication was working my making some changes and then examining contextCSNs. Sometime later replication stopped on one of the servers. Looking at the log on the failing slave I saw:
Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - SYNC_ID_SET Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 (-1) Can't contact LDAP server Sep 2 22:12:53 directory3 slapd[7005]: do_syncrepl: rid=000 rc -1 retrying
On the master I saw:
Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 ACCEPT from IP=171.67.218.156:44519 (IP=0.0.0.0:389) Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND authcid="ldap/directory3.stanford.edu@stanford.edu" authzid="ldap/directory3.stanford.edu@stanford.edu" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="cn=directory3,cn=ldap,cn=operational,dc=stanford,dc=edu" mech=GSSAPI sasl_ssf=56 ssf=56 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 RESULT tag=97 err=0 text= Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH base="cn=accesslog" scope=2 deref=0 filter="(&(objectClass=auditWriteObject)(reqResult=0))" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 closed (connection lost on write)
Running slapd on the failing slave with -d -1 I saw:
=>do_syncrep2 rid=000 ldap_result ld 0x1c07000 msgid 4 wait4msg ld 0x1c07000 msgid 4 (timeout 0 usec) wait4msg continue ld 0x1c07000 msgid 4 all 0 ** ld 0x1c07000 Connections: * host: directory-master.stanford.edu port: 389 (default) refcnt: 2 status: Connected last used: Thu Sep 2 22:12:53 2010
** ld 0x1c07000 Outstanding Requests: * msgid 4, origid 4, status InProgress outstanding referrals 0, parent count 0 ld 0x1c07000 request count 1 (abandoned 0) ** ld 0x1c07000 Response Queue: Empty ld 0x1c07000 response count 0 ldap_chkResponseList ld 0x1c07000 msgid 4 all 0 ldap_chkResponseList returns ld 0x1c07000 NULL ldap_int_select read1msg: ld 0x1c07000 msgid 4 all 0 ber_get_next sasl_generic_read: want=4, got=4 0000: 00 00 00 78 ...x sasl_generic_read: want=120, got=120
... lots of output ...
sasl_generic_read: want=45833, got=0
ldap_read: want=8, got=0
ber_get_next failed. ldap_err2string do_syncrep2: rid=000 (-1) Can't contact LDAP server
I am including a pointer to the complete debug log.
Looks like something in the sasl support is unhappy. We are using libsasl2-modules-gssapi-heimdal 2.1.23.dfsg1-5~sbp50+2.
Strange to me that this one server is having a problem. There are actually 5 slaves that are working just fine. Here are current contextCSNs for the systems.
directory-master contextCSN: 20100903060040.713677Z#000000#000#000000 directory1 contextCSN: 20100903060040.713677Z#000000#000#000000 directory2 contextCSN: 20100903060040.713677Z#000000#000#000000 directory3 contextCSN: 20100903015531.064407Z#000000#000#000000 directory4 contextCSN: 20100903060040.713677Z#000000#000#000000 directory11 contextCSN: 20100903060040.713677Z#000000#000#000000 directory12 contextCSN: 20100903060040.713677Z#000000#000#000000
What else should I look at?
Bill