In our 2.6.4 deployment, we had a significant spike in CPU usage one day last week that lasted approximately 2 hours (8 AM UTC to 10 AM UTC). During this time, some clients started timing out when talking to the LDAP service, and search response times spiked as well, up to 9.5 seconds on searches that normally take < 3 seconds (they do have large result sets). This happened on all 6 of the read nodes that we have in our load balance pool, so whatever the issue was hit all of them at the same time. It did not happen to 2 specialized read nodes that only serve one specific service, so it was something about the traffic going to those 6 nodes. The number of ops/second during that time frame was actually lower than usual across the cluster, with a peak of 200 ops/second. We often have higher peaks than that without this type of CPU usage spiking.
I'm curious what with modern slapd + LMDB should be looked for that would drive such a spike. I thought perhaps there were a significant number of write operations at the same time, but this was not the case, there was no unusual level of write activity. There were also much lower than usual number of concurrent connections across the cluster during this time (~800), we usually have closer to 2k-3k concurrent connections. The total number of initiated operations during the time frame was also within normal range. There was also nothing unusual about amount of network traffic, it fit right in with normal traffic levels.
One thing that I did see is that there was an unusually high number of 'deferring operation: binding' messages. We normally average about 400/day, but on this specific day we hit > 1500 such messages.
Thanks, Quanah