Hi
I had a master/replicas setup that has been working for years. Recently, I discovered that a replica sometime failed to return existing entries: no error, it just returned nentries=0, for a few queries, then went back to normal.
I suspected some database corruption: I killed slapd, removed the databases, restarted slapd so that it resync from master. It pulled a few hundreds of records and then got its up to date contextCSN while a lot of entries are still missing. Restarting slapd exhibit a stright failure in syncrepl: Sep 28 06:34:29 motul slapd[2901]: do_syncrepl: rid=217 rc -2 retrying
How do I debug that? I had the problem with OpenLDAP-2.4.21/ db-4.7.25.3. I tried upgrading the replica to OpenLDAP-2.4.32 / db-4.8.30 but it did not change anything. Is there a chance that upgrading the master (runs 2.4.21 too) will help?
During the short time syncrepl runs, logs are filled with stuff like this, I wonder if it is a related problem or not: Sep 28 03:30:16 motul slapd[269]: conn=-1 op=0 => bdb_dn2id_add dn="uid=user,ou=foo,dc=example,dc=net" ID=0x7c: put failed: DB_LOCK_DEADLOCK: Locker killed to r esolve a deadlock -30994
On Fri, Sep 28, 2012 at 06:53:12AM +0200, Emmanuel Dreyfus wrote:
Sep 28 06:34:29 motul slapd[2901]: do_syncrepl: rid=217 rc -2 retrying
How do I debug that? I had the problem with OpenLDAP-2.4.21/ db-4.7.25.3. I tried upgrading the replica to OpenLDAP-2.4.32 / db-4.8.30 but it did not change anything. Is there a chance that upgrading the master (runs 2.4.21 too) will help?
I did try upgrading the master, it fails the same way, with a different error message Sep 28 11:04:14 motul slapd[5783]: do_syncrepl: rid=217 rc 32 retrying
On Fri, Sep 28, 2012 at 09:10:52AM +0000, Emmanuel Dreyfus wrote:
I did try upgrading the master, it fails the same way, with a different error message Sep 28 11:04:14 motul slapd[5783]: do_syncrepl: rid=217 rc 32 retrying
I found the problem. I had this in slapd.conf
database bdb suffix "ou=foo,dc=example,dc=net" (...) syncrepl rid=317 searchbase="ou=foo,dc=example,dc=net" (...) database bdb suffix "dc=example,dc=net" (...) syncrepl rid=217 searchbase="dc=example,dc=net" (...) syncrepl rid=317 searchbase="ou=foo,dc=example,dc=net" (...)
Removing the duplicated syncrepl rid=317 fixed the problem. It would be nice if slapd could refuse to start in such conditions.
openldap-technical@openldap.org