Ok, that's embarrassing. I forgot the last couple lines of each of the slapd.confs.  Just pretend each of the four ends with the following lines after all the syncrepl rids have been configured:

mirrormode TRUE
overlay syncprov
syncprov-checkpoint 50 10
syncprov-sessionlog 100


On Thu, Mar 31, 2011 at 9:06 PM, Mark <mah042@gmail.com> wrote:
I've been testing a 4-way multi-master setup using OpenLDAP 2.4.25 and I'm having some sporadic problems with it that I'm having difficulty diagnosing..

I have four identical RHEL 4.9 machines on the same switch (NTP syncronized to same stratum 2 servers):
  dual-core Xeon 5110 1.60GHz
  8GB RAM
  100Mb full-duplex NIC
  OpenLDAP 2.4.25, BDB 4.8.30, OpenSSL 1.0.0d, Cyrus SASL 2.1.23 (using no tls/ssl at this time)

I start the slapds with '-d conns,sync' then commence. I ldapadd 1000 DNs to one of the servers. After all the syncing has stopped I then compare the slapd contents against each other looking for differences. Occasionally there are as much as a couple hundred DNs missing from one or more of the instances. When that happens I've noticed that the mmaster with less DNs has lost its consumer connection to a mmaster provider (confirmed using lsof and netstat) and will never attempt a re-connect, but the provider still shows the connection (using lsof and netstat). When the consumer gets in this state I can connect to its cn=config and cn=monitor backends (and browse them) but when I try to connect to its multi-master'd backend the connection attempt just hangs. It's almost like the connect succeeds but the client is waiting for a response from the server (and never gets it).  Also, the consumer slapd will not respond to a 'kill -TERM' at this time and must be 'kill -KILL'd. The same thing occurs sometimes when I delete the entire tree.

I've been trying to catch logging information that might help but so far nothing's jumping out at me. While I continue to try to reproduce and parse through logfiles maybe someone can look at my slapd.confs below and see if I might have configured something wrong (I'm listing the original slapd.conf files below, but I've used slaptest to convert them to slapd.d/cn=config.ldif format):

HOST1 slapd.conf:

include /tmp/openldap/multi-master/etc/schema/core.schema
include /tmp/openldap/multi-master/etc/schema/cosine.schema
include /tmp/openldap/multi-master/etc/schema/nis.schema
argsfile /tmp/openldap/multi-master/var/run/slapd.args
pidfile /tmp/openldap/multi-master/var/run/slapd.pid
threads 16
idletimeout 0
writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300
sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 001
modulepath /tmp/openldap/multi-master/libexec
moduleload back_monitor.la
moduleload back_hdb.la
moduleload syncprov.la

database config
rootdn cn=manager,cn=config
rootpw {SSHA}yMFj3Y7KPh223NkkKLQsFeLUVm08Ckpm

database monitor
rootdn cn=manager,cn=monitor
rootpw {SSHA}vPVSN8o8eRnLdC/bGS7yDwQGeH4BHc0R

database hdb
suffix dc=example,dc=com
rootdn cn=manager,dc=example,dc=com
rootpw {SSHA}0obbsJw5Yq2XAkdd/kS7vokaB9rrSOtI
directory /tmp/openldap/multi-master/var/data/dc=example,dc=com
cachesize 30000
cachefree 5
checkpoint 128 15
dncachesize 25000
idlcachesize 100000
index objectClass eq
index entryCSN eq
index entryUUID eq

syncrepl rid=001
  provider=ldap://host2:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=002
  provider=ldap://host3:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=003
  provider=ldap://host4:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off


HOST2 slapd.conf:

include /tmp/openldap/multi-master/etc/schema/core.schema
include /tmp/openldap/multi-master/etc/schema/cosine.schema
include /tmp/openldap/multi-master/etc/schema/nis.schema
argsfile /tmp/openldap/multi-master/var/run/slapd.args
pidfile /tmp/openldap/multi-master/var/run/slapd.pid
threads 16
idletimeout 0
writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300
sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 002
modulepath /tmp/openldap/multi-master/libexec
moduleload back_monitor.la
moduleload back_hdb.la
moduleload syncprov.la

database config
rootdn cn=manager,cn=config
rootpw {SSHA}yMFj3Y7KPh223NkkKLQsFeLUVm08Ckpm

database monitor
rootdn cn=manager,cn=monitor
rootpw {SSHA}vPVSN8o8eRnLdC/bGS7yDwQGeH4BHc0R

database hdb
suffix dc=example,dc=com
rootdn cn=manager,dc=example,dc=com
rootpw {SSHA}0obbsJw5Yq2XAkdd/kS7vokaB9rrSOtI
directory /tmp/openldap/multi-master/var/data/dc=example,dc=com
cachesize 30000
cachefree 5
checkpoint 128 15
dncachesize 25000
idlcachesize 100000
index objectClass eq
index entryCSN eq
index entryUUID eq

syncrepl rid=001
  provider=ldap://host1:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=002
  provider=ldap://host3:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=003
  provider=ldap://host4:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off


HOST3 slapd.conf:

include /tmp/openldap/multi-master/etc/schema/core.schema
include /tmp/openldap/multi-master/etc/schema/cosine.schema
include /tmp/openldap/multi-master/etc/schema/nis.schema
argsfile /tmp/openldap/multi-master/var/run/slapd.args
pidfile /tmp/openldap/multi-master/var/run/slapd.pid
threads 16
idletimeout 0
writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300
sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 003
modulepath /tmp/openldap/multi-master/libexec
moduleload back_monitor.la
moduleload back_hdb.la
moduleload syncprov.la

database config
rootdn cn=manager,cn=config
rootpw {SSHA}yMFj3Y7KPh223NkkKLQsFeLUVm08Ckpm

database monitor
rootdn cn=manager,cn=monitor
rootpw {SSHA}vPVSN8o8eRnLdC/bGS7yDwQGeH4BHc0R

database hdb
suffix dc=example,dc=com
rootdn cn=manager,dc=example,dc=com
rootpw {SSHA}0obbsJw5Yq2XAkdd/kS7vokaB9rrSOtI
directory /tmp/openldap/multi-master/var/data/dc=example,dc=com
cachesize 30000
cachefree 5
checkpoint 128 15
dncachesize 25000
idlcachesize 100000
index objectClass eq
index entryCSN eq
index entryUUID eq

syncrepl rid=001
  provider=ldap://host1:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=002
  provider=ldap://host2:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=003
  provider=ldap://host4:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

HOST4 slapd.conf:

include /tmp/openldap/multi-master/etc/schema/core.schema
include /tmp/openldap/multi-master/etc/schema/cosine.schema
include /tmp/openldap/multi-master/etc/schema/nis.schema
argsfile /tmp/openldap/multi-master/var/run/slapd.args
pidfile /tmp/openldap/multi-master/var/run/slapd.pid
threads 16
idletimeout 0
writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300
sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 004
modulepath /tmp/openldap/multi-master/libexec
moduleload back_monitor.la
moduleload back_hdb.la
moduleload syncprov.la

database config
rootdn cn=manager,cn=config
rootpw {SSHA}yMFj3Y7KPh223NkkKLQsFeLUVm08Ckpm

database monitor
rootdn cn=manager,cn=monitor
rootpw {SSHA}vPVSN8o8eRnLdC/bGS7yDwQGeH4BHc0R

database hdb
suffix dc=example,dc=com
rootdn cn=manager,dc=example,dc=com
rootpw {SSHA}0obbsJw5Yq2XAkdd/kS7vokaB9rrSOtI
directory /tmp/openldap/multi-master/var/data/dc=example,dc=com
cachesize 30000
cachefree 5
checkpoint 128 15
dncachesize 25000
idlcachesize 100000
index objectClass eq
index entryCSN eq
index entryUUID eq

syncrepl rid=001
  provider=ldap://host1:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=002
  provider=ldap://host2:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off

syncrepl rid=003
  provider=ldap://host3:1389
  type=refreshAndPersist
  interval=00:00:05:00
  retry="15 +"
  searchbase="dc=example,dc=com"
  binddn="cn=manager,dc=example,dc=com"
  credentials="example_pass"
  starttls=no
  schemachecking=off


Thank you.