rein@basefarm.no wrote:
Hm, I hope I have found the race condition that causes this :-) I'm now running with the patch at the end to see if that solves it, only time will tell..
The race is that between the time selecting on the syncrepl socket is enabled by the call to connection_client_enable() and the release of the si_mutex a new message may arrive. If so, the next call to do_syncrepl may fail in its attempt to trylock the mutex and no-one will re-enable selecting on it again. My patch delays enabling of the socket until the mutex has been released, which looks safe to me. Or can the access to si->si_conn without a lock be a problem?
How about just moving the enable to after the runqueue manipulation is done?
Just need to be sure that do_syncrepl() isn't entered again before si->si_conn gets initialized.
It also occurs to me that we probably don't even need to manipulate the slapd runqueue in persist mode, when si->si_conn is already set. I.e., in that case we can only have gotten here because of a listener event, and not because of a runqueue schedule.