Hi
We have an OpenLDAP 2.4.44 based, 4-way MMR setup with 4 M entries, which is fairly write intensive (Zimbra).
Lately we've seen very frequent lockups of the master that receives the updates (only 1 out of 4), whereas the replicas stay responsive. According to -d stats logs, all threads suddenly take a long time to answer any queries, and slapd can no longer accept new connections. The issue always disappears again without intervention, but usually hits a number of times in a row, on an almost daily basis.
We tested a lot of things, but eventually "solved" the issue with a slapcat and slapadd of the database - the master server has been completely stable again since. The mdb was also reduced 50% in size.
Looking at the old mdb (prior to the dump), mdb_stat -f shows it had over 3.7 M free pages. Could it be an issue of database fragmentation similar to ITS#8664?
Is it natural that the freelist (and thus the mdb) gets this big over time, I would expect those free pages to get reused constantly? And in that case would it make sense to monitor the number of free pages? Is there a threshold to look for, before things get problematic again? (ITS#7770 would come handy here, as we already monitor/graph various metrics from the monitor backend)
Geert
openldap-technical@openldap.org