Rein Tollevik wrote:
To me it looks more as the extended test050 have triggered race conditions that already was there, and that especially the syncprov half of ITS#5973 have added to the likelihood that they should be shown.
I have run the current test050 script with the 2.4.15 source (which didn't include these patches), and with RE24 (as of two days ago) without ITS#5973, and have seen the same type of failures there. Also, had the problems been triggered by the consumers receiving NEW_COOKIE messages then I would have expected to see "too old" messages on the consumers when it ignores entries. Instead, I find no trace of the missing entries ever being passed on from the provider. But where the update is lost I haven't found out yet. The problem seem to occur when the server where entries are missing receives its updates from one of the other consumers (i.e, not directly from server1). But whether it is syncrepl on this intermediate server that fails to pass it on to syncprov, or syncprov that looses them, I don't know.
Also, I now have around 30 core files similar to the one in ITS#5999, and I have also had a number of cases where I had to kill -9 a slapd running in a tight unlock, yield, lock loop at the same place in syncprov_op_mod(). These loops have all happened when slapd should be stopping, and the mt structure looks equally invalid as with the seg. fault cases. I have no idea as to whether this has anything to do with the test050 failures or not.
Btw, all of the test050 failures I have seen due to missing replications have taken place immediately after the initial loading of the consumers from server1. This could be a coincident, but I have had enough or them to start wondering...
Yes, I've seen the same. My suspicion now is that it's due to an update arriving in the consumer near when it transitions from refresh to persist mode, but I haven't been able to isolate it. I also note that adding a SLEEP1 near the beginning of test050, after the consumers have been started but before the ldapadd to populate the privder, completely eliminated the problem. So there's definitely an issue there that needs to be tracked down.
I've also seen the op_mod spin during shutdown. Unfortunately with the rest of the state already destroyed we can't identify what led to it. Seems we need to run the test a few times without restarting the servers to track that.