OpenLDAP ITS:
OK the issue is looking more and more like buggy slapd behavior.
I have now narrowed down the issue to an instance where slapd is partially "hung" and will not stop nor restart. This directly correlates to replication breaking because SLAPD is breaking:
Oct 11 18:22:08 server1 slapd[10771]: <= bdb_equality_candidates: (uidNumber) not indexed Oct 11 18:22:08 server1 slapd[10771]: <= bdb_equality_candidates: (gidNumber) not indexed Oct 11 18:22:08 server1 slapd[10771]: <= bdb_equality_candidates: (uidNumber) not indexed Oct 11 18:46:41 server1 slapd[10771]: <= bdb_equality_candidates: (uidNumber) not indexed Oct 11 18:48:37 server1 slapd[10771]: <= bdb_equality_candidates: (uidNumber) not indexed Oct 11 18:49:05 server1 slapd[10771]: daemon: shutdown requested and initiated. Oct 11 18:49:05 server1 slapd[10771]: slapd shutdown: waiting for 0 operations/tasks to finish
As you can see, the exact time when this occurs doesn't bring anything interesting to the logs. You can see the repeated string of index warnings (not an issue, just haven't indexed this attribute yet) followed by my attempt to restart slapd when I receive a notification indicating there is a replication discrepancy.
I have grepped through all of my logs (dmesg, debug, syslog) for anything related to slapd. What you see above is the more interesting of the hits returned.
PLEASE help -- the issue is getting more serious now, and by the evidence I've presented, is looking more and more out of my control.
You've seen my config - can anyone think of why this would happen? It seems vaguely like a locking issue ....
Thanks
Jeff