Full_Name: Rein Tollevik Version: CVS head OS: solaris8 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (84.215.2.34)
Our persistent syncrepl consumers stops receiving data after a while, with no indication of why :-( They don't recognize restarts of the producer, so the only way to get the replication running again is to restart the consumers.
The consumers have a single bdb backend database that is replicated from the producer, and uses the auditlog overlay on this backend. There are 4 of them, running in pairs as load-balanced search servers on two sites. The two servers in each pair are identical configured, and they are all running 64bit solaris8 if that matters.
There are two master servers, one on each site, with a more complicated configuration. They replicate subordinate backend datebases between each other, the consumers replicate the glue suffix of these backends. These servers are running linux in 32- and 64bit mode, and I have not seen the same type of replication stops between these master servers.
Using netstat it shows that the send queue is full on the producer side, and the receive window is empty on the consumer, which to me looks as if the consumer has stopped reading from the provider.
I have used a debugger to look at the servers after they have stopped receiving, and the syncrepl task is sitting at the end of slapd_rq.task_list, with next_sched.tv_sec==0 (which mean it will not be scheduled normally?) The syncrepl is configured with "retry=60 +". The si_conn of the syncinfo_t looks normal, and its c_sd socket is on slap_daemon.sd_actives fd_set, but not on sd_readers nor sd_writers. And none of the threads are running any syncrepl code.
Looking at "slapd -d sync" output and the auditlogs it seem to stop receiving in the middle of a burst of updates (although that could be a coinsident). The other slave in the pairs continues to receive updates, so I assume this is a consumer side problem.
So far it looks to me as if the syncrepl thread has managed to return without adding the connection socket back to slap_daemon.sd_readers. But whether that is correct or not, and how it has managed to do it if so, I cannot tell.
Any pointers as to what can be wrong are highly appreciated. I have a core file from a server that has stopped receiving if there is anything I should look for there.
Rein Tollevik Basefarm AS