Simone Piccardi piccardi@truelite.it schrieb am 05.11.2020 um 16:17 in
Nachricht 5a6d778a-b75b-3027-3a88-f5507c83977b@truelite.it:
Il 03/11/20 22:49, Quanah Gibson‑Mount ha scritto:
The problem manifests itself without periodicity and looking on the number of connection before it we could not see any usage peak. We tried to strace slapd threads during the problem, and they seem blocked on a mutex waiting for the one running at 100% (in a single CPU, user time). I'm attaching a top results during one of these events.
If you can attach to the process while this is occurring, I'd suggest obtaining a full GDB backtrace to see what the different slapd threads are doing at that time. Also, what mutex specifically is slapd waiting
on?
I executed gstack on the slapd pid during one of such events saving the output, they are attached, but the running slapd is stripped so they are quite obscure (at least for me).
I think even when stripped, you could "re-attach" the symbols (given that you saved them before stripping). For some dirstributions, such symbol (debug) packages are available for install. I don't know for your package source, however.
We are trying to put in a non stripped version (compiled with CFLAGS='‑g" and ‑‑enable‑debug=yes) in use for a test, but that's a production machine, and it will take a while.
What I should do to find which one the mutex is? in the straces they are identified just by a number.
So a first question is: there is any other configuration parameter about indexing that I can try?
If you really believe that this is indexing related, you should be able to tell this from the slapd logs at "stats" logging, where you would see a specific search taking a significant amount of time. However that generally does not lead to a system that's paused as searches shouldn't trigger a mutex issue like what you're describing.
No, it is not that I believe that, as I said it was just a guess about something that could need full CPU for tens of seconds blocking all other operations. But from what you are saying the guess is probably plain wrong.
Is this on RHEL7 or later? If you have both "stats" and "sync" logging enabled (the recommended setting for replicating nodes), what does the slapd log show is happening at this time?
The server is running an updated version of Amazon Linux (Amazon Linux AMI 2018.03).
We enabled stats and sync to logs, and I'm attaching a redacted excerpt of them around the incident time, when I also took the gstack.txt (done at 00:39:04) and gstack2.txt (done at 00:39:20) backtraces. But during that time there is no data.
Simone