I am running Buchan Milne's build of openldap-2.3.43-1.rhel5.x86_64 on a RedHat 5.2 system running kernel 2.6.18-92.1.6.el5. in a VMware instance.
This system runs as a master to 3 other syncrepl slaves. Very few queries are made against this system, it is simply a central place to direct updates to. Writes are infrequent, averaging 4-5 and hour. Databases are hdb.
This system was running openldap-servers-2.3.39-3.rhel5.x86_64 for several months until a few days ago when I did a update via yum pointing to http://staff.telkomsa.net/packages/rhel5/openldap/x86_64/
This system has been running without issues until last night. Then this appeared in the logs during deletion of a user account:
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed
Other accounts have been deleted before and after this event. Further attempts to delete this particular entry always fail with the same error.
In an attempt to fix this, I shutdown slapd, ran /usr/bin/slapd_db_recover in the /var/lib/ldap directory and restarted slapd.
Attempting to delete the entry after this produced:
Jul 25 14:20:14 janus slapd[15043]: conn=3 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 14:20:14 janus slapd[15043]: conn=3 op=1 RESULT tag=107 err=32 text=
Further attempts return us to:
Jul 25 14:24:54 janus slapd[15043]: conn=19 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 14:24:54 janus slapd[15043]: conn=19 op=1 RESULT tag=107 err=80 text=DN index delete failed
Searching the list, I see similar problems mentioned in these threads:
http://www.openldap.org/lists/openldap-software/200801/msg00257.html http://www.openldap.org/lists/openldap-software/200802/msg00045.html
There did not seem to be any resolution to these however.
Norman Gaywood wrote:
I am running Buchan Milne's build of openldap-2.3.43-1.rhel5.x86_64 on a RedHat 5.2 system running kernel 2.6.18-92.1.6.el5. in a VMware instance.
This system runs as a master to 3 other syncrepl slaves. Very few queries are made against this system, it is simply a central place to direct updates to. Writes are infrequent, averaging 4-5 and hour. Databases are hdb.
This system was running openldap-servers-2.3.39-3.rhel5.x86_64 for several months until a few days ago when I did a update via yum pointing to http://staff.telkomsa.net/packages/rhel5/openldap/x86_64/
This system has been running without issues until last night. Then this appeared in the logs during deletion of a user account:
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed
Other accounts have been deleted before and after this event. Further attempts to delete this particular entry always fail with the same error.
In an attempt to fix this, I shutdown slapd, ran /usr/bin/slapd_db_recover in the /var/lib/ldap directory and restarted slapd.
Attempting to delete the entry after this produced:
Jul 25 14:20:14 janus slapd[15043]: conn=3 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 14:20:14 janus slapd[15043]: conn=3 op=1 RESULT tag=107 err=32 text=
Result code 32 indicates that the object does not exist in the DIT.
Have you added any new indexes previous to this? How did you upgrade your RPMs?
Further attempts return us to:
Jul 25 14:24:54 janus slapd[15043]: conn=19 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 14:24:54 janus slapd[15043]: conn=19 op=1 RESULT tag=107 err=80 text=DN index delete failed
Searching the list, I see similar problems mentioned in these threads:
http://www.openldap.org/lists/openldap-software/200801/msg00257.html http://www.openldap.org/lists/openldap-software/200802/msg00045.html
There did not seem to be any resolution to these however.
RedHat 5.2 system running kernel 2.6.18-92.1.6.el5. in a VMware instance.
This system runs as a master to 3 other syncrepl slaves. Very few queries are made against this system, it is simply a central place to direct updates to. Writes are infrequent, averaging 4-5 and hour. Databases are hdb.
This system was running openldap-servers-2.3.39-3.rhel5.x86_64 for several months until a few days ago when I did a update via yum pointing to http://staff.telkomsa.net/packages/rhel5/openldap/x86_64/
This system has been running without issues until last night. Then this appeared in the logs during deletion of a user account:
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed
Few questions:
- are you using bdb/hdb or ldbm?
- did you run slapindex as root while your slapd runs as a regular user?
You should be able to recover by slapcat'ting the database (the entry should no longer be there) and slapadd'ing the resulting file. Make sure permissions and ownership are correct before restarting the database. And upgrade: OpenLDAP 2.3.43 is available (and 2.4.11).
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
On Fri, 25 Jul 2008, Norman Gaywood wrote:
Jul 25 14:24:54 janus slapd[15043]: conn=19 op=1 RESULT tag=107 err=80 text=DN index delete failed
[...]
http://www.openldap.org/lists/openldap-software/200801/msg00257.html
That all happened around the same time I hit some extremely odd corruption, ITS#5342. The eventual consensus there, as you can read, was chalked up to some previously-standing issue (be them Sleepycat's or OpenLDAP's, or maybe even a lost write on my ATA drive), so there was no real action apart from blowing away the database and slapadd'ing it with the latest version, and no followup necessary in terms of code changes.
If you can figure this out in a reproducible fashion with current releases, that would be interesting. I left off confident that that wouldn't be possible.
Thanks for the responses. To sum up, after a few more "DN index delete failed" messages on my LDAP master server, I slapcat'ed, deleted and slapadd'ed my databases and everything seems fine again.
The responses I received were all generally trying to figure out what I might have done to cause the problem. Be it errors in rpm upgrade, running database maintenance programs as the wrong user, or some other event.
In fact, there was an event a few days before the errors that may have caused the problem. On one of my slaves, a full disk caused replication to fail. The slave began like this:
Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device Jul 22 08:10:01 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:10:01 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) Jul 22 08:10:01 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device Jul 22 08:10:02 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:10:02 dyaus slapd2.3[2132]: syncrepl_updateCookie: rid 001 be_modify failed (80)
Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device Jul 22 08:30:35 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 08:30:35 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left)
On the master at the corresponding time, I have:
Jul 22 08:10:01 janus slapd[2049]: conn=31060 fd=17 ACCEPT from IP=129.180.1.107:33547 (IP=0.0.0.0:389) Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=syncuser,dc=une,dc=edu,dc=au" method=128 Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=SyncUser,dc=une,dc=edu,dc=au" mech=SIMPLE ssf=0 Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 RESULT tag=97 err=0 text= Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH base="dc=une,dc=edu,dc=au" scope=2 deref=0 filter="(objectClass=*)" Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH attr=* structuralObjectClass entryCSN
A delay on this connection:
Jul 22 08:30:35 janus slapd[2049]: conn=31060 op=2 UNBIND Jul 22 08:30:35 janus slapd[2049]: conn=31060 fd=17 closed
Then on the master, the logs started to look like this. Note that there were no disk problems on the master.
Jul 22 09:39:40 janus slapd[2049]: send_search_entry: conn 33464 ber write failed. Jul 22 09:39:40 janus slapd[2049]: conn=33464 fd=17 closed (connection lost on write) Jul 22 09:39:40 janus slapd[2049]: connection_read(17): no connection!
Followed, on each replication attempt, by:
Jul 22 09:40:14 janus slapd[2049]: send_search_entry: conn 33498 ber write failed. Jul 22 09:40:14 janus slapd[2049]: conn=33498 fd=17 closed (connection lost on write)
Jul 22 09:40:16 janus slapd[2049]: send_search_entry: conn 33501 ber write failed. Jul 22 09:40:16 janus slapd[2049]: conn=33501 fd=17 closed (connection lost on write)
Jul 22 09:44:34 janus slapd[2049]: send_search_entry: conn 33759 ber write failed. Jul 22 09:44:34 janus slapd[2049]: conn=33759 fd=17 closed (connection lost on write)
etc. Many repeats of this, until the master server crashed and I restarted it:
Jul 22 10:06:50 janus slapd[2049]: send_ldap_response: ber write failed Jul 22 10:38:15 janus slapd[29187]: hdb_db_open: unclean shutdown detected; attempting recovery. Jul 22 10:38:17 janus slapd[29187]: slapd starting
While I was searching for what might have caused the crash. I got a few more:
Jul 22 10:41:36 janus slapd[29187]: send_search_entry: conn 71 ber write failed. Jul 22 10:41:36 janus slapd[29187]: conn=71 fd=17 closed (connection lost on write)
On the slave many no space errors until:
Jul 22 10:48:10 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 2072163: No space left on device Jul 22 10:48:10 dyaus slapd2.3[2132]: => bdb_idl_delete_key: c_del id failed: No space left on device (28) Jul 22 10:48:10 dyaus slapd2.3[2132]: Attribute index delete failure Jul 22 10:48:10 dyaus slapd2.3[2132]: null_callback: error code 0x50 Jul 22 10:48:10 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) Jul 22 10:48:10 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1) Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying Jul 22 10:48:12 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1)
Yes the master had crashed again, hence the bind fail. At this point I had discovered the disk problem and had made space available on the slave.
The master was restarted. Here is the crash and restart:
Jul 22 10:48:09 janus slapd[29187]: send_ldap_response: ber write failed Jul 22 10:48:09 janus slapd[29187]: ch_calloc of 1 elems of 46912585677634 bytes failed Jul 22 10:50:40 janus slapd[29859]: hdb_db_open: unclean shutdown detected; attempting recovery. Jul 22 10:50:40 janus slapd[29859]: slapd starting
Things seemed to be back to normal. My concern then was that a bad slave had caused the master to crash. I then noticed that Buchan Milne had just released a new version of his RPMS, so I upgraded the master. From the yum logs you can see the version transition:
Jan 11 09:45:02 Updated: openldap.x86_64 2.3.39-3.rhel5 Jul 22 11:04:15 Updated: openldap - 2.3.43-1.rhel5.x86_64
A few days later I started getting the error I reported originally on the master:
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed
Whew, that took some time to put that error report together. I hope I didn't miss anything vital.
In summary I think I could say that with openldap.x86_64 2.3.39 I could crash the master syncrepl server by causing disk write errors on a slave. I don't know if this problem still exists in 2.3.43.
openldap-software@openldap.org