>> John Jasen <jjasen(a)gmail.com> schrieb am 20.07.2018 um
15:41 in Nachricht
<2b16aef8-a174-2ae4-f113-45cc6a414a49(a)gmail.com>:
On 07/20/2018 04:41 AM, Ulrich.Windl(a)rz.uni-regensburg.de wrote:
> Hi!
>
> Stupid question: could it be your load-balancer that had a problem?
> How does the netstat look like (sockets opened, queued data, etc.?)
I do not believe it to be the load-balancer. They log loss of contact
with the LDAP servers and drop them from the relay group shortly after
one of these events start; and when it gets cleaned up, they're added
back in. I also do not suspect network between the load balancers and
the LDAP servers.
During such an event, ps -efT will usually show slapd running at full
thread capacity. Comparing that to threads in cn=monitor is not
possible, as those ldap searches fail.
Still I don't know the internals on slapd, but could it runs out of worker threads? DO
you monitor the theads' activity up to the problem?
Open sockets does not substantially change until after the event
subsides. The servers will show 1200-2000 open sockets before an event,
and drop lower when it clears up -- to quickly scale back up to
pre-event levels.
The other idea is to try to run "strace ... -p pid" on the hanging process to
see what it is doing, or maybe even try to attach gdb to the process (most useful if the
binary still contains debug info).
The queues will show data being held until the socket(s) time out.
OK, so it des not look like a problem in the load-balancer.
Regards,
Ulrich