On Sun, 30 Mar 2008, hyc@symas.com wrote:
rein@tollevik.no wrote:
On Sat, 29 Mar 2008, ando@sys-net.it wrote:
rein@basefarm.no wrote:
I was seeing random failures of the test050-syncrepl-multimaster test. One of the failures was that it went into a tight loop traversing a circular runqueue it had managed to create in slapd_rq.task_list. It seems as this was caused by missing mutex locks around accesses to slapd_rq, which the patch uploaded to ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes.
Before I applied this patch the test failed after being run a few times, with it it has now passed 100 times and is still counting.
locks in back-bdb/config.c should be pointless, as modifications to the configuration should only occur while all threads are paused. The rest makes sort of sense, but I'd leave it to Howard.
Ignoring the ITS#5403 changes, I don't see anything here that isn't config-related, therefore it's all running single-threaded.
Now that the configuration can be changed dynamically (as this test does) I find it a bit odd that the config stuff should always be running single-threaded. But there is obviously much I don't know about the internals of slapd.
Of the "relevant" changes in syncrepl.c, I note that three out of the four chunks of the patch are in code that is only run when using cn=config to delete a syncrepl configuration, and test050 never performs that operation. The remaining chunk only takes affect when adding syncrepl config, and again, slapd is single-threaded for that.
I have been experimenting a bit more with the patch, and it appears to be the locks added to add_syncrepl() that stopped the test from failing. The test uses ordinary ldapadd to add the syncrepl configuration after slapd has been started, which I would expect to indicate that it has entered the multi-threaded state. But again, it can revert to single-threaded when modifying the configuration for all I know.
I've also run test050 thru hundreds of iterations without any issue, without these patches. If there's a problem in test050, I don't believe it's in this code.
I have also run it several hundred times on other systems, it is only on one of the 4 systems I build slapd that it fails. That is, that test failed on one of the other systems with the 2.4.8 release, also with an infinite loop, but that dosn't say vary much in this context.
The system where the test fail is the only multi-cpu system I have run it on, which I recon to be the reason why it only fails there. Without this patch the test fails about 20% of the times it is run on that system, the reason varies between slapd being killed due to abort or seg. fault (unknown why), or infinite loop traversing the circular slapd_rq. With the patch it has succeeded several thousand times in a row, which I account for more than just a coincident.
So, as long as this patch doesn't create any deadlock I can't really see the problem with it and will stick to it in our version. I find it a bit easier to understand and verify code that requiers locking if the locks are always used, then I don't need to know the thread states that are allowed to call the functions as well to know whether a lock is needed or not..
Rein