Il 05/11/20 16:26, Howard Chu ha scritto:
Traces from a stripped binary are useless.
Using a non stripped binary in production lead to some performance problem, so it took a while to plan a deploy for enough time to take data.
I'm attaching two gstack traces taken during two different events. What I see is a thread (the second one, here the slapd pid was 31267) in epoll_wait, all the other waiting in a futex excepted one, 31280. But I non skilled enough to understand what it is doing.
We also had a much longer event later than the ones from which the attached traces are taken (lasted some minutes). During this one all the thread were straced and we don't have a gstack trace.
This time we also got some different errors in the log, and I'm attacching a redacted excerpt hoping that they may constitute a useful clue. What we found were messages like:
Nov 10 00:49:56 ldp-11 slapd[31267]: nonpresent_callback: rid=012 present UUID 258f12bd-b531-426e-8dc8-49263545db58, dn cn=905719,cn=protected,o=ourorg
They starte appear around five seconds before most of the slapd thread stopped waiting on a futex (that happened near 00:50:03). After that there were still lot of messages on the logs (but only "nonpresent_callback" ones) up to around 00:50:57; then nothing more until the activities resumed (around 00:53:06).
From the strace we saw that the second thread (MAIN_PID+1, here 31268) was ever processing epoll_wait, with some activity in the beginning, then just awakening every 2500ms doing nothing, another thread was sending (with sendto) a lot of messages to fd 3 (from their beginning they seems syslog messages) for about 50 seconds (and the log is full of "nonpresent_callback" in this time) then it also stopped in the same futex of the other threads.
The only thread (except for the epoll waiter) that never stopped, was doing just the following system calls:
00:50:05.384738 mprotect(0x7f23bb1bc000, 9613312, PROT_READ|PROT_WRITE) = 0 00:53:05.415500 msync(0x7f23e6e56000, 25769803776, MS_SYNC) = 0 00:53:06.114513 futex(0x7f29eae6e040, FUTEX_WAKE, 1) = 1
the FUTEX_WAKE was on the futex stopping all the other ones, and after that one they restarted working.
I hope this could be enough to pin down the problem.
Simone