John Jasen jjasen@gmail.com schrieb am 20.07.2018 um 15:41 in Nachricht
2b16aef8-a174-2ae4-f113-45cc6a414a49@gmail.com:
On 07/20/2018 04:41 AM, Ulrich.Windl@rz.uni-regensburg.de wrote:
Hi!
Stupid question: could it be your load-balancer that had a problem? How does the netstat look like (sockets opened, queued data, etc.?)
I do not believe it to be the load-balancer. They log loss of contact with the LDAP servers and drop them from the relay group shortly after one of these events start; and when it gets cleaned up, they're added back in. I also do not suspect network between the load balancers and the LDAP servers.
During such an event, ps -efT will usually show slapd running at full thread capacity. Comparing that to threads in cn=monitor is not possible, as those ldap searches fail.
Still I don't know the internals on slapd, but could it runs out of worker threads? DO you monitor the theads' activity up to the problem?
Open sockets does not substantially change until after the event subsides. The servers will show 1200-2000 open sockets before an event, and drop lower when it clears up -- to quickly scale back up to pre-event levels.
The other idea is to try to run "strace ... -p pid" on the hanging process to see what it is doing, or maybe even try to attach gdb to the process (most useful if the binary still contains debug info).
The queues will show data being held until the socket(s) time out.
OK, so it des not look like a problem in the load-balancer.
Regards, Ulrich