On 10/15/2013 01:10 PM, michael.vishchers@7p-group.com wrote:
It is not the client loop that is multithreading but the ldap server.
And it is not a misuse of the API but a problem that may be raised by day t= o day network problems.
I've boiled down the problem to a few simple configurations that work (or b= etter, fail ;-) with both 2.4.23 and 2.4.36. A tgz file containing a setup = with start script and testclient is attached. It should be sufficient to re= produce the fault.
The problem occurs only if we use session variable substitution in the rwm = overlay, and only if a search is *immediately* (e.g. caused by network loss= and client timeout) followed by an unbind.
I modified the reproducer a bit (the start script) and find out a few things. You can find the reproducer I'm using at [1].
Valgrind's helgrind shows some lock problems in the rwm overlay and also in back-ldap and connection.c. After correcting those the issue seems to be gone.
You can find helgrind logs at [2] (before the fix) and [3] (after).
Also, ElectricFence reveals some problems [4], which I didn't fix yet.
A fix attempt can be found at [5]. I'm not sure if that is a correct fix, or it just masked the real issue. But I didn't to manage to reproduce the problem after applying it.
[1] http://jsynacek.fedorapeople.org/openldap/its7723/reproducer/ [2] http://jsynacek.fedorapeople.org/openldap/its7723/results/slapd1-helgrind-br... [3] http://jsynacek.fedorapeople.org/openldap/its7723/results/slapd1-helgrind-fi... [4] http://jsynacek.fedorapeople.org/openldap/its7723/results/slapd1-broken-efen... [5] http://jsynacek.fedorapeople.org/openldap/its7723/0001-fix-possible-race-con...
Cheers,