Hi,
we got a quite strange behaviour in which a slapd server stops
processing connections for some tens of seconds while a single thread is
running 100% on a single CPU and all other CPU are almost idle.
When the problem arise there is no significant iowait or disk I/O (and
no swapping, that's disabled). Context switches just go near zero (from
some tens of thousand to some hundreds). Load average is almost always
under 2.
The server has 32G of RAM and 4 HT processors, is running
openldap-2.4.54 in mirror mode (but no delta replication) using the mdb
backend. The same behaviour was found also with 2.4.53. OpenLDAP is the
only service running on it, apart SSH and some monitoring tools.
Database maxsize is 25G around 17G are used.
I'm attaching a redacted configuration of the main server (the secondary
one is the same, with IDs reverted for mirror mode use)
Most of the time it works just fine, processing a up to a few thousand
of read query per second while having some tens of write per second.
Connections are managed by HA-proxy, sending them to this server by
default (used as main node). Many times these stop are short (around 10
second) and we don't lost connections, but when the problem arise and
last for enough time, HAproxy switch to the second node, and we got
downtimes. Staying with the secondary node we have the same behaviour.
The problem manifests itself without periodicity and looking on the
number of connection before it we could not see any usage peak. We tried
to strace slapd threads during the problem, and they seem blocked on a
mutex waiting for the one running at 100% (in a single CPU, user time).
I'm attaching a top results during one of these events.
From the behaviour I was suspecting (just a wild and uninformated guess)
some indexing issue, blocking all access.
We tried to change tool-threads to 4 because I found it cited in some
example as related to threads used for indexing, but the change has no
effect. Re-reading last version of man-page, if I understand it
correctly, it's effective only for slapadd etc.
So a first question is: there is any other configuration parameter about
indexing that I can try?
Anyway I'm not sure if there is an effective indexing issue (indexes are
quite basic). I was suspecting this because there are lot of writes, and
there is no strace activity during the stop. I should look somewhere else?
Any suggestion on further checks or configuration changes will be more
than appreciated.
Regards
Simone