Thanks for the responses. To sum up, after a few more "DN index delete failed" messages on my LDAP master server, I slapcat'ed, deleted and slapadd'ed my databases and everything seems fine again.
The responses I received were all generally trying to figure out what I might have done to cause the problem. Be it errors in rpm upgrade, running database maintenance programs as the wrong user, or some other event.
In fact, there was an event a few days before the errors that may have caused the problem. On one of my slaves, a full disk caused replication to fail. The slave began like this:
Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device Jul 22 08:10:01 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:10:01 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) Jul 22 08:10:01 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device Jul 22 08:10:02 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:10:02 dyaus slapd2.3[2132]: syncrepl_updateCookie: rid 001 be_modify failed (80)
Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device Jul 22 08:30:35 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:30:35 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left)
On the master at the corresponding time, I have:
Jul 22 08:10:01 janus slapd[2049]: conn=31060 fd=17 ACCEPT from IP=129.180.1.107:33547 (IP=0.0.0.0:389) Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=syncuser,dc=une,dc=edu,dc=au" method=128 Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=SyncUser,dc=une,dc=edu,dc=au" mech=SIMPLE ssf=0 Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 RESULT tag=97 err=0 text= Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH base="dc=une,dc=edu,dc=au" scope=2 deref=0 filter="(objectClass=*)" Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH attr=* structuralObjectClass entryCSN
A delay on this connection:
Jul 22 08:30:35 janus slapd[2049]: conn=31060 op=2 UNBIND Jul 22 08:30:35 janus slapd[2049]: conn=31060 fd=17 closed
Then on the master, the logs started to look like this. Note that there were no disk problems on the master.
Jul 22 09:39:40 janus slapd[2049]: send_search_entry: conn 33464 ber write failed. Jul 22 09:39:40 janus slapd[2049]: conn=33464 fd=17 closed (connection lost on write) Jul 22 09:39:40 janus slapd[2049]: connection_read(17): no connection!
Followed, on each replication attempt, by:
Jul 22 09:40:14 janus slapd[2049]: send_search_entry: conn 33498 ber write failed. Jul 22 09:40:14 janus slapd[2049]: conn=33498 fd=17 closed (connection lost on write)
Jul 22 09:40:16 janus slapd[2049]: send_search_entry: conn 33501 ber write failed. Jul 22 09:40:16 janus slapd[2049]: conn=33501 fd=17 closed (connection lost on write)
Jul 22 09:44:34 janus slapd[2049]: send_search_entry: conn 33759 ber write failed. Jul 22 09:44:34 janus slapd[2049]: conn=33759 fd=17 closed (connection lost on write)
etc. Many repeats of this, until the master server crashed and I restarted it:
Jul 22 10:06:50 janus slapd[2049]: send_ldap_response: ber write failed Jul 22 10:38:15 janus slapd[29187]: hdb_db_open: unclean shutdown detected; attempting recovery. Jul 22 10:38:17 janus slapd[29187]: slapd starting
While I was searching for what might have caused the crash. I got a few more:
Jul 22 10:41:36 janus slapd[29187]: send_search_entry: conn 71 ber write failed. Jul 22 10:41:36 janus slapd[29187]: conn=71 fd=17 closed (connection lost on write)
On the slave many no space errors until:
Jul 22 10:48:10 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 2072163: No space left on device Jul 22 10:48:10 dyaus slapd2.3[2132]: => bdb_idl_delete_key: c_del id failed: No space left on device (28) Jul 22 10:48:10 dyaus slapd2.3[2132]: Attribute index delete failure Jul 22 10:48:10 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 10:48:10 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) Jul 22 10:48:10 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1) Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying Jul 22 10:48:12 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1)
Yes the master had crashed again, hence the bind fail. At this point I had discovered the disk problem and had made space available on the slave.
The master was restarted. Here is the crash and restart:
Jul 22 10:48:09 janus slapd[29187]: send_ldap_response: ber write failed Jul 22 10:48:09 janus slapd[29187]: ch_calloc of 1 elems of 46912585677634 bytes failed Jul 22 10:50:40 janus slapd[29859]: hdb_db_open: unclean shutdown detected; attempting recovery. Jul 22 10:50:40 janus slapd[29859]: slapd starting
Things seemed to be back to normal. My concern then was that a bad slave had caused the master to crash. I then noticed that Buchan Milne had just released a new version of his RPMS, so I upgraded the master. From the yum logs you can see the version transition:
Jan 11 09:45:02 Updated: openldap.x86_64 2.3.39-3.rhel5 Jul 22 11:04:15 Updated: openldap - 2.3.43-1.rhel5.x86_64
A few days later I started getting the error I reported originally on the master:
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed
Whew, that took some time to put that error report together. I hope I didn't miss anything vital.
In summary I think I could say that with openldap.x86_64 2.3.39 I could crash the master syncrepl server by causing disk write errors on a slave. I don't know if this problem still exists in 2.3.43.