I had an unexpected and completely undocumented crash of slapd this morning. I'm looking for some hints on tracking it down.
Here's the background.
We are running 2.3.41 (locally built RPM) on RedHat EL 4 with four slave servers (running the same 2.3.41 and RHEL4). We use a nightly update process where we slapcat the master database, apply the changes from the systems of record (students, employees, retirees, etc) to the LDIF, generate an ldapmodify data stream and run ldapmodify to apply the changes.
The student system made some massive changes this morning which caused us to generate an ldapmodify input file with 31,973 changes (adds, modifies, modrdn's) in it. The ldapmodify on the master took 8 minutes. The delta-syncrepl to the slave/replica servers took 33 to 44 minutes. The replica delta-syncrepl processes seem to have been averaging about 800 changes per minute, which is quite slow for what I was expecting.
Since it took so long for the replica's to get all the changes, they fell more than the 10 minutes behind the master server and the person on call got paged (nagios monitoring of the replica and master CSN's). The person on call had not been properly trained (my fault) to look for the syncrepl messages in the syslog on the replica servers and thus they issued a restart on one of the replicas (thinking that something was hung). The replica restarted properly, but the master seems to have crashed without a sound at the same time. There was no core file generated and I haven't found anything logged in the syslog on the master. slapd was started on the master, and the output of the startup says that the accesslog database had an unclean shutdown and needed to be recovered (which it was successfully).
I'm wondering the following things:
1) Is it possible that one of the ITS's for syncrepl that will be included in 2.3.42 would address this crash? Any suggestions on tracking down why it crashed?
2) Does it appear that I have a configuration problem (the delta-syncrepl taking about five times as long to get the changes out to the replicas as it took to apply them on the master)? Where would you suggest I look if it is likely?
Thanks,