We have an OpenLDAP 2.4.44 based, 4-way MMR setup with 4 M entries,
which is fairly write intensive (Zimbra).
Lately we've seen very frequent lockups of the master that receives
the updates (only 1 out of 4), whereas the replicas stay responsive.
According to -d stats logs, all threads suddenly take a long time to
answer any queries, and slapd can no longer accept new connections.
The issue always disappears again without intervention, but usually
hits a number of times in a row, on an almost daily basis.
We tested a lot of things, but eventually "solved" the issue with a
slapcat and slapadd of the database - the master server has been
completely stable again since. The mdb was also reduced 50% in size.
Looking at the old mdb (prior to the dump), mdb_stat -f shows it had
over 3.7 M free pages. Could it be an issue of database fragmentation
similar to ITS#8664?
Is it natural that the freelist (and thus the mdb) gets this big over
time, I would expect those free pages to get reused constantly?
And in that case would it make sense to monitor the number of free
pages? Is there a threshold to look for, before things get problematic
again? (ITS#7770 would come handy here, as we already monitor/graph
various metrics from the monitor backend)
geert.hendrickx.be :: geert(a)hendrickx.be :: PGP: 0xC4BB9E9F
This e-mail was composed using 100% recycled spam messages!
Show replies by thread