Pierangelo Masarati wrote:
Pierangelo Masarati wrote:
No more failures of this kind; however, now I intermittently get replication failures:
The problem persists (only once in a while). It might still be connection-related, since the logs of server #3, the proxy that pushes replication to the consumer, are stuffed with tons of "connection_read(...): no connection!"
What kind of system are you running on? Linux / multiprocessor?
One of the problems with epoll() on Linux is that it wakes up for HANGUP events all the time (they are not selectable in the input options; they're delivered regardless of whether you choose to wait for them or not). This also means we can't shut the notifications off when we acknowledge/act on them. So you'll get lots of repeated wakeups for the same hangup event. The new connection_hangup() function processes these inline for normal connections, but it still falls into the connection_read thread handling for client connections, so their normal cleanup handlers can be invoked. If your server is too busy, it will take a while for the submitted thread to execute, and then you'll get a lot of these spurious messages.
I've been experimenting with epoll's edge-triggered and oneshot modes, which would prevent multiple wakeups occurring for the same event. But unfortunately, when I set that it seems that the events can't be *re-enabled* when we want them, and so slapd hangs. Still looking at this.
But that's beside the point - you shouldn't be seeing any replication failures at all, regardless of connection close handling. What else are you seeing now?