In our 2.6.4 deployment, we had a significant spike in CPU usage one day last week that lasted approximately 2 hours (8 AM UTC to 10 AM UTC). During this time, some clients started timing out when talking to the LDAP service, and search response times spiked as well, up to 9.5 seconds on searches that normally take < 3 seconds (they do have large result sets). This happened on all 6 of the read nodes that we have in our load balance pool, so whatever the issue was hit all of them at the same time. It did not happen to 2 specialized read nodes that only serve one specific service, so it was something about the traffic going to those 6 nodes. The number of ops/second during that time frame was actually lower than usual across the cluster, with a peak of 200 ops/second. We often have higher peaks than that without this type of CPU usage spiking.
I'm curious what with modern slapd + LMDB should be looked for that would drive such a spike. I thought perhaps there were a significant number of write operations at the same time, but this was not the case, there was no unusual level of write activity. There were also much lower than usual number of concurrent connections across the cluster during this time (~800), we usually have closer to 2k-3k concurrent connections. The total number of initiated operations during the time frame was also within normal range. There was also nothing unusual about amount of network traffic, it fit right in with normal traffic levels.
One thing that I did see is that there was an unusually high number of 'deferring operation: binding' messages. We normally average about 400/day, but on this specific day we hit > 1500 such messages.
Thanks, Quanah
-----Original Message----- From: Quanah Gibson-Mount quanah@fast-mail.org we usually have closer to 2k-3k concurrent connections. The total number of initiated operations during the time frame was also within normal range. There was also nothing unusual about amount of network traffic, it fit right in with normal traffic levels.
Every one of these that I have seen, with OpenLDAP (or Sun DSEE or AD) has been due to client traffic.
Connection count is not always related to workload. You can have a couple of connections doing a lot of searches, right?
Quanah, I thought you'd know better than me: The "deferring operation: binding" message is an indication of a client with a lot of outstanding async requests, right?
So I think your problem is probably not a lot of clients or a lot of connection but one or a few sending a lot of work to your server in the form of, if not writes; if not a voluminous quantity (as evidenced by your network metrics seeming normal), maybe some complex filters?
* Christopher Paul [23/06/2023 18:14] :
Quanah, I thought you'd know better than me: The "deferring operation: binding" message is an indication of a client with a lot of outstanding async requests, right?
I have this message on one of my directories and it is due to a client performing two binds without waiting for the response between the two calls.
Sadly, this results in the directory slowly stopping responding to requests over time.
Emmanuel
--On Saturday, June 24, 2023 3:44 AM +0200 Emmanuel Seyman emmanuel@seyman.fr wrote:
- Christopher Paul [23/06/2023 18:14] :
Quanah, I thought you'd know better than me: The "deferring operation: binding" message is an indication of a client with a lot of outstanding async requests, right?
I have this message on one of my directories and it is due to a client performing two binds without waiting for the response between the two calls.
The message in this case is clearly a symptom rather than the problem, more noting that it shows how significantly the servers were affected during this time period. :)
Regards, Quanah
--On Friday, June 23, 2023 7:14 PM +0000 Christopher Paul chris.paul@rexconsulting.net wrote:
So I think your problem is probably not a lot of clients or a lot of connection but one or a few sending a lot of work to your server in the form of, if not writes; if not a voluminous quantity (as evidenced by your network metrics seeming normal), maybe some complex filters?
That's what I'm trying to determine, but I'm not seeing anything really standing out from normal traffic. Even one query that spent a long qtime value during the interval was clearly due to something else, as it was a very basic query that exact matched one entry using indexed attributes to find it, and normally is nearly instant to return.
--Quanah
openldap-technical@openldap.org