On Thu, Sep 11, 2008 at 2:33 AM, Quanah Gibson-Mount quanah@zimbra.com wrote:
Please test RE24.
Thanks!
--Quanah
on suse 10.2 SMP / xen kernel, ie :
Linux suse 2.6.22.18-0.2-xen #1 SMP 2008-06-09 13:53:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
test failed - producer and consumer2 databases differ
./scripts/test050-syncrepl-multimaster failed (exit 1)
make[2]: *** [bdb-yes] Error 1 make[2]: Leaving directory `/home/maxfieb/keep-ldap/openldap-src-RE24/tests' make[1]: *** [test] Error 2 make[1]: Leaving directory `/home/maxfieb/keep-ldap/openldap-src-RE24/tests' make: *** [test] Error 2
compiling against bdb 4.7.25.NC (no patched applied)
bdb options :
../dist/configure --disable-java --enable-posixmutexes --enable-largefile --enable-cxx --disable-compat185 make make install
openldap configure :
CFLAGS="-g -g3 -ggdb -ggdb3" CPPFLAGS="-I/usr/local/BerkeleyDB.4.7/include" \ LDFLAGS="-L/usr/local/BerkeleyDB.4.7/lib -R/usr/local/BerkeleyDB.4.7/lib" \ ./configure --enable-slapd --enable-backends --enable-shared \ --enable-overlays --disable-ipv6 --disable-sql --disable-ndb --with-threads \ --enable-ssl --enable-collect --enable-debug
building HEAD with same openldap configure options and same bdb binaries above, all tests succeed (albeit much more slowly.. perhaps DB_CONFIG needs tweaking in head ?)..
Cheers Brett
Brett @Google wrote:
On Thu, Sep 11, 2008 at 2:33 AM, Quanah Gibson-Mountquanah@zimbra.com wrote:
Please test RE24.
Thanks!
--Quanah
on suse 10.2 SMP / xen kernel, ie :
Linux suse 2.6.22.18-0.2-xen #1 SMP 2008-06-09 13:53:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
test failed - producer and consumer2 databases differ
./scripts/test050-syncrepl-multimaster failed (exit 1)
make[2]: *** [bdb-yes] Error 1 make[2]: Leaving directory `/home/maxfieb/keep-ldap/openldap-src-RE24/tests' make[1]: *** [test] Error 2 make[1]: Leaving directory `/home/maxfieb/keep-ldap/openldap-src-RE24/tests' make: *** [test] Error 2
compiling against bdb 4.7.25.NC (no patched applied)
bdb options :
../dist/configure --disable-java --enable-posixmutexes --enable-largefile --enable-cxx --disable-compat185
BerkeleyDB will use excessive CPU on Linux unless you configure --with-mutex=POSIX/pthreads. By default it uses "hybrid" mutexes, which use an assembly language spinlock loop before falling back to pthread mutex. The spinlock will chew up all of your CPU on multiprocessor machines...
make make install
openldap configure :
CFLAGS="-g -g3 -ggdb -ggdb3" CPPFLAGS="-I/usr/local/BerkeleyDB.4.7/include" \ LDFLAGS="-L/usr/local/BerkeleyDB.4.7/lib -R/usr/local/BerkeleyDB.4.7/lib" \ ./configure --enable-slapd --enable-backends --enable-shared \ --enable-overlays --disable-ipv6 --disable-sql --disable-ndb --with-threads \ --enable-ssl --enable-collect --enable-debug
building HEAD with same openldap configure options and same bdb binaries above, all tests succeed (albeit much more slowly.. perhaps DB_CONFIG needs tweaking in head ?)..
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
Howard Chu skrev:
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
This reopens the race condition window that ITS closed. If a trylock is required then we must find some other way of asserting that the socket is either not enabled while the lock is held, or that any messages received while it is locked is acted upon.
I have run test050 with this patch, and see the same failures as in http://www.openldap.org/lists/openldap-devel/200809/msg00051.html I thought the errors was due to some recent changes in the code, but it appears to be the populating of consumer2 added in revision 1.11 of test050 itself that triggers it. I.e, the actual bug can have been there for some time :-(
Rein
Rein Tollevik wrote:
Howard Chu skrev:
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
This reopens the race condition window that ITS closed. If a trylock is required then we must find some other way of asserting that the socket is either not enabled while the lock is held, or that any messages received while it is locked is acted upon.
I have run test050 with this patch, and see the same failures as in http://www.openldap.org/lists/openldap-devel/200809/msg00051.html I thought the errors was due to some recent changes in the code, but it appears to be the populating of consumer2 added in revision 1.11 of test050 itself that triggers it. I.e, the actual bug can have been there for some time :-(
AFAIR, writing to consumer2 was exactly added to trigger an issue that test050 was hiding in its initial form.
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati skrev:
Rein Tollevik wrote:
Howard Chu skrev:
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
This reopens the race condition window that ITS closed. If a trylock is required then we must find some other way of asserting that the socket is either not enabled while the lock is held, or that any messages received while it is locked is acted upon.
I have run test050 with this patch, and see the same failures as in http://www.openldap.org/lists/openldap-devel/200809/msg00051.html I thought the errors was due to some recent changes in the code, but it appears to be the populating of consumer2 added in revision 1.11 of test050 itself that triggers it. I.e, the actual bug can have been there for some time :-(
AFAIR, writing to consumer2 was exactly added to trigger an issue that test050 was hiding in its initial form.
Do you remember what issue? After applying the patch to ITS#5719 this test has switch from failing almost half of the times I run it to now managing several hundreds iterations before deadlocking (i.e not the same symptoms as I used to see). I have had two cases where it was hung, and they seem to be related to the pool pausing that takes place when the updates to the config is replicated to the second consumer.
The tests hangs in ldapadd populating consumer2, where at least one thread (on the second consumer) replicating the config is hanging in handle_pause(). Note, this is without the lock -> trylock change in syncrepl.c, as I'm convinced that would reintroduce the syncrepl hangs we have seen earlier. None of the threads seem to be waiting on this lock, but I haven't had time to investigate it enough to conclude anything yet.
Rein
Rein Tollevik wrote:
Pierangelo Masarati skrev:
Rein Tollevik wrote:
Howard Chu skrev:
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
This reopens the race condition window that ITS closed. If a trylock is required then we must find some other way of asserting that the socket is either not enabled while the lock is held, or that any messages received while it is locked is acted upon.
I have run test050 with this patch, and see the same failures as in http://www.openldap.org/lists/openldap-devel/200809/msg00051.html I thought the errors was due to some recent changes in the code, but it appears to be the populating of consumer2 added in revision 1.11 of test050 itself that triggers it. I.e, the actual bug can have been there for some time :-(
AFAIR, writing to consumer2 was exactly added to trigger an issue that test050 was hiding in its initial form.
Do you remember what issue? After applying the patch to ITS#5719 this test has switch from failing almost half of the times I run it to now managing several hundreds iterations before deadlocking (i.e not the same symptoms as I used to see). I have had two cases where it was hung, and they seem to be related to the pool pausing that takes place when the updates to the config is replicated to the second consumer.
The tests hangs in ldapadd populating consumer2, where at least one thread (on the second consumer) replicating the config is hanging in handle_pause(). Note, this is without the lock -> trylock change in syncrepl.c, as I'm convinced that would reintroduce the syncrepl hangs we have seen earlier. None of the threads seem to be waiting on this lock, but I haven't had time to investigate it enough to conclude anything yet.
I felt the need to add a write to consumer2 after someone complained that writes to it were not replicated. That modification to the test confirmed that writes to consumer2 were not replicated consistently (at least, most of the times, AFAIR). In fact, prior to that modification, only 2-way MMR was tested, and N-way (at least, 3-way) did not work. Then, the fix we're discussing resulted in writes to consumer2 being consistently replicated. I confirm that test050 is succeeding here for re24 (single core i386).
p.
Ing. Pierangelo Masarati OpenLDAP Core Team
SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ----------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Fax: +39 0382 476497 Email: ando@sys-net.it -----------------------------------
Pierangelo Masarati wrote:
Rein Tollevik wrote:
Pierangelo Masarati wrote:
Rein Tollevik wrote:
Howard Chu wrote:
Howard Chu wrote:
There's not much difference between HEAD and RE24 really. And I've seen test050 fail on both, at random times. Still haven't tracked it down.
Hm. I just saw test050 lock up on me, a repeat of the discussion in ITS#5454. Seems like we need to go back to a trylock here.
This reopens the race condition window that ITS closed. If a trylock is required then we must find some other way of asserting that the socket is either not enabled while the lock is held, or that any messages received while it is locked is acted upon.
I have run test050 with this patch, and see the same failures as in http://www.openldap.org/lists/openldap-devel/200809/msg00051.html I thought the errors was due to some recent changes in the code, but it appears to be the populating of consumer2 added in revision 1.11 of test050 itself that triggers it. I.e, the actual bug can have been there for some time :-(
AFAIR, writing to consumer2 was exactly added to trigger an issue that test050 was hiding in its initial form.
Do you remember what issue? After applying the patch to ITS#5719 this test has switch from failing almost half of the times I run it to now managing several hundreds iterations before deadlocking (i.e not the same symptoms as I used to see). I have had two cases where it was hung, and they seem to be related to the pool pausing that takes place when the updates to the config is replicated to the second consumer.
The tests hangs in ldapadd populating consumer2, where at least one thread (on the second consumer) replicating the config is hanging in handle_pause(). Note, this is without the lock -> trylock change in syncrepl.c, as I'm convinced that would reintroduce the syncrepl hangs we have seen earlier. None of the threads seem to be waiting on this lock, but I haven't had time to investigate it enough to conclude anything yet.
I felt the need to add a write to consumer2 after someone complained that writes to it were not replicated. That modification to the test confirmed that writes to consumer2 were not replicated consistently (at least, most of the times, AFAIR). In fact, prior to that modification, only 2-way MMR was tested, and N-way (at least, 3-way) did not work. Then, the fix we're discussing resulted in writes to consumer2 being consistently replicated. I confirm that test050 is succeeding here for re24 (single core i386).
I hope I finally have found out what causes these deadlocks, and it appears that is was indeed the lock in do_syncrepl and that the switch back to using trylock may have cured the symptoms. The real problem was that do_syncrepl was called twice though. It appears that the call to slapd_set_read() done by client_setup_connection() is not needed and creates a race condition that could cause this.
The race appears when do_syncrepl creates the initial connection to its provider and thereby calls connection_client_setup(). When it again calls slapd_add_internal() the socket gets added to the descriptors the daemon loop monitors (using epoll in my case). The daemon thread immediately returns from epoll_wait() as input is already available on the syncrepl socket. It disables further events on that socket in connection_read_activate() before submitting the socket to a thread for processing.
Now, back in the syncrepl thread connection_client_setup() continues by calling slapd_set_read(), thereby enabling events on the socket the daemon thread has just disabled events on. This causes the daemon thread to return from epoll_wait() again and firing off a new thread to handle the syncrepl input (that the first thread it submitted hasn't read yet).
During ordinary syncrepl processing this simply causes the second thread to wait for the first to finish, probably finding nothing to read on the socket anyhow when it acquires the syncrepl lock. But, if the first thread receives config updates (as in the test050) it goes to sleep waiting for all other threads to pause. Which the second thread never does as it is waiting for the syncrepl lock the first holds.
I'm now running test050 without the slapd_set_read() call done in connection_client_setup(). So far it has been running for around three hours, passing about 250 times. But it must succeed for at least a day more before I'll conclude that this is really the fix.
While looking at this I came across another problem introduced by the ITS#5454 fix. When the replication is paused after a config update is received it is rescheduled for immediate processing (and burning cpu on a multi-core system unless the ITS#5719 fix is included). When it the next time gets called from the runqueue the connection is open and it simply calls connection_client_enable() before skipping all runqueue manipulation. I.e the rtask will stay on the running list forever. Maybe the best solution to this is to revert all the ITS#5454 fixes, and re-fixing it by simply switching from trylock to a real lock at the start of do_syncrepl()?
Rein
Rein Tollevik wrote:
I'm now running test050 without the slapd_set_read() call done in connection_client_setup(). So far it has been running for around three hours, passing about 250 times. But it must succeed for at least a day more before I'll conclude that this is really the fix.
1700 passes now, and counting :-) Anyone that has any comments, or should I go ahead filing ITSes and committing?
Rein
Rein Tollevik wrote:
Rein Tollevik wrote:
I'm now running test050 without the slapd_set_read() call done in connection_client_setup(). So far it has been running for around three hours, passing about 250 times. But it must succeed for at least a day more before I'll conclude that this is really the fix.
1700 passes now, and counting :-) Anyone that has any comments, or should I go ahead filing ITSes and committing?
Rein
Go ahead and commit. We can tweak the patch after if needed.