Enabling LogLevel=sync gave me enough information to find out what fails:
Dec 15 10:44:27 forward01.unix slapd[9470]: [ID 349886 local4.debug] syncrepl_entry: rid 405 uid=desk,o=xxxx.xxx,ou=mail
Dec 15 10:44:27 forward01.unix slapd[9470]: [ID 190661 local4.debug] <= entry_decode: slap_str2undef_ad( 20091027142315Z#000001#00#000000): AttributeDescription contains inappropriate characters
I did a ldapsearch to dump that node, then ldapdelete and ldapadd -f ldif to restore it back in, in the hope that it would fix it. The entry on ldapmaster looked correct though.
The client spent considerably longer in sync, but eventually still died on the same entry (and very last entry in sync).
Turns out that the ldapmaster is correct, but all clients that replsync from master are all stuck on that entry:
# /usr/local/bin/ldapsearch -h forward01.unix -b uid=desk,o=xxxx.xxx,ou=mail
# search result search: 2 result: 80 Internal (implementation specific) error text: internal error
If I run slapindex on one of the clients I get:
<= entry_decode: slap_str2undef_ad( 20091027142315Z#000001#00#000000): AttributeDescription contains inappropriate characters bdb_tool_entry_reindex: could not locate id=289608
Similarly slapcat:
# slapcat -s uid=desk,o=xxxx.xxx,ou=mail
<= entry_decode: slap_str2undef_ad( 20091027142315Z#000001#00#000000): AttributeDescription contains inappropriate characters # no data for entry id=00046b48
So, for some reason all clients got this wrong record everywhere. Luckily, they are all slaves, so I can delete the entire database and let each one catch up. It is a shame I can't even just delete the bad record.
Lund
Jorgen Lundman wrote:
openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES SunOS ldapmaster01.unix 5.10 Generic_127128-11
Our LDAP setup has been running rather well for the last 2 years or so, but increasingly we have had to restart slapd more and more frequently. Occasional core-dumps and "very large" process footprints as well.
I suspect that it is perhaps stuck on replication, this message show up on the clients fairly frequently:
Dec 14 11:30:49 forward01.unix slapd[10494]: [ID 190661 local4.debug] <= entry_decode: slap_str2undef_ad( 20091027142315Z#000001#00#000000): AttributeDescription contains inappropriate characters
Dec 14 11:30:49 forward01.unix slapd[10494]: [ID 818565 local4.debug] null_callback: error code 0x50
Dec 14 11:30:49 forward01.unix slapd[10494]: [ID 776556 local4.debug] syncrepl_entry: rid 405 be_add failed (80)
Dec 14 11:30:49 forward01.unix slapd[10494]: [ID 747041 local4.debug] do_syncrepl: rid 405 retrying (9 retries left)
Which makes me think that it has been stuck since 20091027. Is there a way for me to find out which entry 20091027142315Z#000001#00#000000 refers to, so I can just delete it? (or fix it).
The pertinent lines in ldapmaster are:
lastmod on checkpoint 128 15 directory /usr/local/var/openldap-data
index objectClass eq index uid eq index uidNumber eq index mail eq index mailAlternateAddress pres,eq index deliveryMode eq index accountStatus eq index gecos eq index radiusGroupName eq index o pres,eq index entryCSN,entryUUID eq index gidNumber eq
overlay syncprov
syncprov-checkpoint 100 10 syncprov-sessionlog 100
And each slave has identical conf, except for the RID which is based on the last octet of the IP address:
lastmod on checkpoint 128 15 directory /usr/local/var/openldap-data
index objectClass eq index uid eq index uidNumber eq index mail eq index mailAlternateAddress pres,eq index deliveryMode eq index accountStatus eq index gecos eq index radiusGroupName eq index o pres,eq index entryCSN,entryUUID eq index gidNumber eq
syncrepl rid=405 provider=ldap://172.20.12.113 type=refreshAndPersist interval=00:00:00:30 searchbase="ou=mail,dc=gmo,dc=jp" filter="(objectClass=*)" attrs="*" scope=sub schemachecking=off updatedn="cn=admin,dc=gmo,dc=jp" bindmethod=simple binddn="cn=admin,dc=gmo,dc=jp" credentials="*censored*" retry="60 10 300 +"
updateref ldap://172.20.12.113
In general, do they seem reasonable? Everything has been appearing to be correct, except for the 20091027 entry. What is the recommended procedure? Reading the documentation makes me think I should also index "ContextCSN"?
A recent ldapmaster core was triggered here: #0 0xce7948da in t_delete () from /lib/libc.so.1 #1 0xce79417f in _malloc_unlocked () from /lib/libc.so.1 #2 0xce794058 in malloc () from /lib/libc.so.1 #3 0x08133175 in ber_memalloc_x () #4 0x08133487 in ber_dupbv_x () #5 0x0813352b in ber_dupbv () #6 0x0807c22e in attr_dup ()
... which I believe is replication-related due to a post made earlier in the list. Most likely also fixed in later versions.
Thanks for any reply,
Jorgen Lundman