Hi,
Following up on this ITS i opened a while back.
With multi-master, normal syncrepl, i would sometimes receive:
slapd: null_callback : error code 0x10 slapd: syncrepl_entry: rid=106 be_modify failed (16)
Triggering a syncrepl connection drop/retry, whilst playing the sessionlog when a server with multiple providers was started.
I am now testing with 2.4.44 and have had a chance to look at this annoying, but seemingly not destructive issue in some more detail.
As i partially referenced previously, this occurs within syncrepl_entry, for modifications, a diff of old_entry to new_entry is performed. Then if changes are needed a be_modify is performed. There is however, no locking which prevents two, or more, threads from performing these diffs, and then mods, in an interleaved fashion within this function itself.
Looking in do_syncrep2, if the cookie tag is present the cs_pmutex is acquired and held for the duration of modifications. This mutex protects from syncrepl_entry race conditions and serializes modifications.
I have also noticed this issue during normal operations (ie all syncrepl in persist) when out of order writes are occurring on a master which are relatively easy to reproduce on an hdb backend server.
When a cookie is not sent with an entry the cs_pmutex is not acquired. Without having some protection, non-cookie modifications will race each other between syncrepl threads.
So, i am testing surrounding the syncrepl_entry "if" block (line 1036) with a cs_pmutex lock/release (when punlock < 0) to serialize non_cookie mods just like the cookie ones. So far this is running tests and i haven't seen the null_callback issue, either when catching up from the session log, or running with ongoing out of order writes being replicated (running alongside unmodified 2.4.44 to compare differences).
When acquiring the cs_pmutex i have used the same logic as at line 958 (using trylock, with a shutdown check). I wonder if it is safe to acquire the mutex with a standard ldap_pvt_thread_mutex_lock at this point without spinning.
line numbers from RELENG_2_4 (721a038b7bc9732f52eeef5324c180c4f137cd75)
Thanks
Tom