On Fri, 23 Jun 2023 at 12:04, Quanah Gibson-Mount quanah@fast-mail.org wrote:
In our 2.6.4 deployment, we had a significant spike in CPU usage one day last week that lasted approximately 2 hours (8 AM UTC to 10 AM UTC). During this time, some clients started timing out when talking to the LDAP service, and search response times spiked as well, up to 9.5 seconds on searches that normally take < 3 seconds (they do have large result sets). This happened on all 6 of the read nodes that we have in our load balance pool, so whatever the issue was hit all of them at the same time. It did not happen to 2 specialized read nodes that only serve one specific service, so it was something about the traffic going to those 6 nodes. The number of ops/second during that time frame was actually lower than usual across the cluster, with a peak of 200 ops/second. We often have higher peaks than that without this type of CPU usage spiking.
I've noticed similar behavior on large accesslog purges. High CPU, poor response times, sometimes slapd even becomes unresponsive. Do these systems have an accesslog that gets purged?
Many things can cause a CPU usage spike.
If you don't already know, then the first step is finding out what process is spiking.
When I ran an active website for several years, I developed a very efficient monitoring program, that watched the system and recorded when a program was consuming more CPU or memory than was normal.
This monitoring program consumes almost no resources when the system is not overloaded, and very few resources even when it is overloaded. You can set command line options to specify what details you want to see of processes causing an overload, and what constitutes an overload worth reporting.
I just now uploaded this program to:
https://github.com/ThePythonicCow/batch_top
I left this program running all the time, in the background, on the webserver I managed, and referred to the output when some unexpected overload started causing a problem.
--On Tuesday, June 27, 2023 2:46 AM -0500 Paul Jackson pj@usa.net wrote:
Many things can cause a CPU usage spike.
If you don't already know, then the first step is finding out what process is spiking.
That I already know -- slapd, it's the only service on the system. ;)
I just now uploaded this program to:
Seems handy!
Regards, Quanah
--On Monday, June 26, 2023 9:43 PM -0400 David Hawes dhawes@vt.edu wrote:
I've noticed similar behavior on large accesslog purges. High CPU, poor response times, sometimes slapd even becomes unresponsive. Do these systems have an accesslog that gets purged?
I definitely have hit that issue before, but this deployment is using standard Syncrepl. For delta-syncrepl, I found frequent purge intervals (maximum of 4 hours between them) resolved the issue. It does take some tuning depending on the general volume of the changes recorded in the accesslog db.
--Quanah
openldap-technical@openldap.org