Basic Master/Slave replication, take slave down, slave hangs - openldap-technical

7 Aug 2008


      This may be problem that's been fixed from 2.4.6 to 2.4.11.  I'm hoping
someone recognizes it and can confirm it's been fixed.  (We committed to
2.4.6 for the short term future.)
We've got basic refreshAndPersist replication working fine.
Replication declarations from the master slapd.conf...
overlay syncprov
syncprov-checkpoint 100 10
syncprov-sessionlog 100
Replication declarations from the slave slapd.conf....
syncrepl rid=123
       provider=ldap://<masterIP>:389
       type=refreshAndPersist
       retry="120 +"
       searchbase="o=replDB"
       bindmethod=simple
       binddn="cn=replman,o=replDB"
       credentials=password
When we shut down the slave, the slave hangs.  Here's the debug log...
daemon: closing 12582954+
slapd shutdown: waiting for 1 threads to terminate+
=.do_syncrepl rid=123+
connection_get(12582955)+
connection_get(12582955): got connid=0+
daemon: removing 12582955r+
ldap_free_request (origid 2, msgid 2)+
ldap_free_connection 1 1+
ldap_send_unbind+
ber_flush2: 7 bytes to sd 12582955+
  0000:  30 05 02 01 03 42 00                               0....B.
+
ldap_write: want=7, written=7+
  0000:  30 05 02 01 03 42 00                               0....B.
+
ldap_free_connection: actually freed+
We think the problem is with the following code in tpool.c.  Specifically,
ltp_open_count  has a value of 1, so we're stuck in a loop.
If we set ltp_open_count to 0, the server comes down properly.  At that
point we can restart it again.
while (pool->ltp_open_count) {
                if (!pool->ltp_pause)
                        ldap_pvt_thread_cond_broadcast(&pool->ltp_cond);
                ldap_pvt_thread_cond_wait(&pool->ltp_cond, &pool->
ltp_mutex);
        }
Does this problem look familiar to anyone?
Once again, we apologize for being on a backlevel release... but we
committed to 2.4.6 for the first release, which is coming up shortly.
We'll be upgrading for future releases... and we're hoping this problem (if
it is a real problem) has been fixed.
Thanks in advance...