If I was facing this symptom, I'd capture a couple of pstack <slapd PID> outputs when the pb is occurring (and maybe in correlation with perf top -p <slapd PID> if pstacks are not enough). That should help avoiding guesses.
++Cyrille
-----Original Message----- From: Simone Piccardi [mailto:piccardi@truelite.it] Sent: Tuesday, November 3, 2020 6:41 PM To: openldap-technical@openldap.org Subject: Connections blocked for some tens of seconds while a single slapd thread running 100%
Hi,
we got a quite strange behaviour in which a slapd server stops processing connections for some tens of seconds while a single thread is running 100% on a single CPU and all other CPU are almost idle. When the problem arise there is no significant iowait or disk I/O (and no swapping, that's disabled). Context switches just go near zero (from some tens of thousand to some hundreds). Load average is almost always under 2.
The server has 32G of RAM and 4 HT processors, is running openldap-2.4.54 in mirror mode (but no delta replication) using the mdb backend. The same behaviour was found also with 2.4.53. OpenLDAP is the only service running on it, apart SSH and some monitoring tools. Database maxsize is 25G around 17G are used.
I'm attaching a redacted configuration of the main server (the secondary one is the same, with IDs reverted for mirror mode use)
Most of the time it works just fine, processing a up to a few thousand of read query per second while having some tens of write per second. Connections are managed by HA-proxy, sending them to this server by default (used as main node). Many times these stop are short (around 10 second) and we don't lost connections, but when the problem arise and last for enough time, HAproxy switch to the second node, and we got downtimes. Staying with the secondary node we have the same behaviour.
The problem manifests itself without periodicity and looking on the number of connection before it we could not see any usage peak. We tried to strace slapd threads during the problem, and they seem blocked on a mutex waiting for the one running at 100% (in a single CPU, user time). I'm attaching a top results during one of these events.
From the behaviour I was suspecting (just a wild and uninformated guess) some indexing issue, blocking all access.
We tried to change tool-threads to 4 because I found it cited in some example as related to threads used for indexing, but the change has no effect. Re-reading last version of man-page, if I understand it correctly, it's effective only for slapadd etc.
So a first question is: there is any other configuration parameter about indexing that I can try?
Anyway I'm not sure if there is an effective indexing issue (indexes are quite basic). I was suspecting this because there are lot of writes, and there is no strace activity during the stop. I should look somewhere else?
Any suggestion on further checks or configuration changes will be more than appreciated.
Regards Simone