On Sun, 30 Mar 2008, hyc(a)symas.com wrote:
rein(a)tollevik.no wrote:
> On Sat, 29 Mar 2008, ando(a)sys-net.it wrote:
>> rein(a)basefarm.no wrote:
>>> I was seeing random failures of the
test050-syncrepl-multimaster test. One of
>>> the failures was that it went into a tight loop traversing a circular
runqueue
>>> it had managed to create in slapd_rq.task_list. It seems as this was caused
by
>>> missing mutex locks around accesses to slapd_rq, which the patch uploaded to
>>>
ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes.
>>>
>>> Before I applied this patch the test failed after being run a few times, with
it
>>> it has now passed 100 times and is still counting.
>> locks in back-bdb/config.c should be pointless, as modifications to the
>> configuration should only occur while all threads are paused. The rest
>> makes sort of sense, but I'd leave it to Howard.
Ignoring the ITS#5403 changes, I don't see anything here that isn't
config-related, therefore it's all running single-threaded.
Now that the configuration can be changed dynamically (as this test does)
I find it a bit odd that the config stuff should always be running
single-threaded. But there is obviously much I don't know about the
internals of slapd.
Of the "relevant" changes in syncrepl.c, I note that three
out of the four
chunks of the patch are in code that is only run when using cn=config to
delete a syncrepl configuration, and test050 never performs that operation.
The remaining chunk only takes affect when adding syncrepl config, and again,
slapd is single-threaded for that.
I have been experimenting a bit more with the patch, and it appears to be
the locks added to add_syncrepl() that stopped the test from failing.
The test uses ordinary ldapadd to add the syncrepl configuration after
slapd has been started, which I would expect to indicate that it has
entered the multi-threaded state. But again, it can revert to
single-threaded when modifying the configuration for all I know.
I've also run test050 thru hundreds of iterations without any
issue, without
these patches. If there's a problem in test050, I don't believe it's in this
code.
I have also run it several hundred times on other systems, it is only on
one of the 4 systems I build slapd that it fails. That is, that test
failed on one of the other systems with the 2.4.8 release, also with an
infinite loop, but that dosn't say vary much in this context.
The system where the test fail is the only multi-cpu system I have run it
on, which I recon to be the reason why it only fails there. Without this
patch the test fails about 20% of the times it is run on that system, the
reason varies between slapd being killed due to abort or seg. fault
(unknown why), or infinite loop traversing the circular slapd_rq. With
the patch it has succeeded several thousand times in a row, which I
account for more than just a coincident.
So, as long as this patch doesn't create any deadlock I can't really see
the problem with it and will stick to it in our version. I find it a bit
easier to understand and verify code that requiers locking if the locks
are always used, then I don't need to know the thread states that are
allowed to call the functions as well to know whether a lock is needed or
not..
Rein