trouble setting initial replication between multiple masters - openldap-software

13 Jan 2010

      Hello list.
I'm trying to achieve multi-master setup, starting from a working 
single-master setup. I took the master node configuration, added the 
following directives, and distributed it identically on two nodes:
# global
serverID 1  ldap://10.202.11.8:389/
serverID 2  ldap://10.202.11.9:389/
# db
...
syncrepl rid=1
     provider=ldap://10.202.11.8:389/
     starttls=yes
     tls_reqcert=never
     type=refreshAndPersist
     retry="60 +"
     logbase="cn=log"
     logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
     syncdata=accesslog
     searchbase="dc=msr-inria,dc=inria,dc=fr"
     scope=sub
     schemachecking=off
     bindmethod=simple
     binddn="cn=syncrepl,ou=roles,dc=msr-inria,dc=inria,dc=fr"
     credentials=XYZ
syncrepl rid=2
     provider=ldap://10.202.11.9:389/
     starttls=yes
     tls_reqcert=never
     type=refreshAndPersist
     retry="60 +"
     logbase="cn=log"
     logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
     syncdata=accesslog
     searchbase="dc=msr-inria,dc=inria,dc=fr"
     scope=sub
     schemachecking=off
     bindmethod=simple
     binddn="cn=syncrepl,ou=roles,dc=msr-inria,dc=inria,dc=fr"
     credentials=XYZ
mirrormode on
The 'tls_reqcert=never' is needed because those two servers are accessed 
from a virtual interface under a load-balancing server, and the 
certificate name matches the name of this virtual interface, not the 
actual interface of the servers (I wonder if openldap would support 
altSubjName in x509 certs, but that's another issue).
Then I imported my base in the first server, and launched both of them.
When node1 (full) tries to access node2 (empty), it fails, because it 
can't authenticate with a DN still not present in other node database, 
which is quite understandable.
However, node2 connects successfully, sync the the OU object in the DIT, 
then fails to actually sync the first user object, with this error 
message in his logs:
Jan 13 11:29:20 avron2 slapd[20939]: null_callback : error code 0x13
Jan 13 11:29:20 avron2 slapd[20939]: syncrepl_entry: rid=001 be_add 
uid=ingleber,ou=users,dc=msr-inria,dc=inria,dc=fr (19)
Jan 13 11:29:20 avron2 slapd[20939]: syncrepl_entry: rid=001 be_add 
uid=ingleber,ou=users,dc=msr-inria,dc=inria,dc=fr failed (19)
Jan 13 11:29:20 avron2 slapd[20939]: do_syncrepl: rid=001 rc 19 retrying
In node1 logs:
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 op=1 BIND 
dn="cn=syncrepl,ou=roles,dc=msr-inria,dc=inria,dc=fr" method=128
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 op=1 BIND 
dn="cn=syncrepl,ou=roles,dc=msr-inria,dc=inria,dc=fr" mech=SIMPLE ssf=0
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 op=1 RESULT tag=97 err=0 
text=
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 op=2 SRCH 
base="dc=msr-inria,dc=inria,dc=fr" scope=2 deref=0 filter="(objectClass=*)"
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 op=2 SRCH attr=* +
Jan 13 10:28:31 avron1 slapd[15713]: send_search_entry: conn 1000  ber 
write failed.
Jan 13 10:28:31 avron1 slapd[15713]: conn=1000 fd=21 closed (connection 
lost on write)
It's hard to tell if the failure occurs on the provider (ber write 
failed message) or consumer side (null_callback : error code 0x13).
Any hint welcome.
-- 
BOFH excuse #288:

Hard drive sleeping. Let it wake up on it's own...