What drives CPU usage spikes? - openldap-technical

23 Jun 2023


      In our 2.6.4 deployment, we had a significant spike in CPU usage one day 
last week that lasted approximately 2 hours (8 AM UTC to 10 AM UTC). 
During this time, some clients started timing out when talking to the LDAP 
service, and search response times spiked as well, up to 9.5 seconds on 
searches that normally take < 3 seconds (they do have large result sets). 
This happened on all 6 of the read nodes that we have in our load balance 
pool, so whatever the issue was hit all of them at the same time.  It did 
not happen to 2 specialized read nodes that only serve one specific 
service, so it was something about the traffic going to those 6 nodes.  The 
number of ops/second during that time frame was actually lower than usual 
across the cluster, with a peak of 200 ops/second.  We often have higher 
peaks than that without this type of CPU usage spiking.
I'm curious what with modern slapd + LMDB should be looked for that would 
drive such a spike.  I thought perhaps there were a significant number of 
write operations at the same time, but this was not the case, there was no 
unusual level of write activity.  There were also much lower than usual 
number of concurrent connections across the cluster during this time 
(~800), we usually have closer to 2k-3k concurrent connections.  The total 
number of initiated operations during the time frame was also within normal 
range.  There was also nothing unusual about amount of network traffic, it 
fit right in with normal traffic levels.
One thing that I did see is that there was an unusually high number of 
'deferring operation: binding' messages.  We normally average about 
400/day, but on this specific day we hit > 1500 such messages.
Thanks,
Quanah