Syncrepl behavior when disk is full - openldap-software

20 Apr 2009


      I am using openldap-2.4.11 with syncrepl n-way multimaster replication and I
am seeing some strange behavior if one of the masters runs out of disk
space.  I would expect this to be handled much the same way a master being
offline would be handled but it behaves different.
We are doing some failure testing on our new ldap infrastructure to prevent
problems before they happen and to be able to improve our monitoring.  On of
the test cases I came up with is what happens on a master or slave if the
disk the ldap database is stored on becomes full.  Now granted we do monitor
for this normally but if somehow it was missed or filled up very fast we
would like to know now what will happen.
So here's the setup:
RHEL 5 i386
VMWare Guest
openldap-2.4.11 (custom RPM, all backends, all overlays, monitor as module,
back ldap as module)
BDB backend
overlay accesslog
overlay ppolicy
overlay syncprov
overlay unique
overlay dynlist
overlay refint
overlay memberof
master1 and master2 replicate to one another with mirror mode
slave2 uses master2 as its replica provider (load balancing later)
master1 <-- mirror mode on --> master2 --> slave2 (updates chaned to
master2)
Scenario 1:
  master2 runs out of disk space, ldap modify request is issued against
master2
Result:
  master2 performs the ldapmodify but master1 and slave2 are not notified
Scenario 1a:
  disk space is freed up on master2
Result:
  change is not replicated to master1 and slave2
Scenario 1b:
  disk space is freed up on master2 and master2 is restarted
Result:
  change is still not replicated to master1 and slave2
Scenario 1c:
  disk space is freed up on master2 and master1 or slave2 restarted
Result:
  change is still not replicated to master1 and slave2
Scenario 2:
  master2 runs out of disk space, ldap modify request is issued against
master1
Result: master1, master2 and slave2 are ALL updated
It gets even worse as well.   Additional changes against multiple objects on
master2 do not get propogated to master1 and slave2.  The only thing that
seems to bring master2 back into sync is to write a change to any objects
modified during the time when the disk was full.
Scenario 3:
  master2 runs out of disk space, ldap modify request is issued against
master2 for object1
Result:
  master2 performs modify but does not replicate it to master1 and slave2
Scenario 3a:
 disk space is freed on master2, ldap modify request is issued on master2
for object2
Result:
  object2 is modified but changes are not replicated to master1 and slave2
Scenario 3b:
  disk space is freed on master2, ldap modify is issued against master1 for
object1
Result:
  master1 performs modify and replicates to master2, master2 replicates to
slave2, changes to object1 on master2 are lost
Scenario 3c:
  disk space is freed on master2, ldap modify is issued against master1 for
object2
Result:
  master1 performs modify and replicates to master2, master2 replicates to
slave2, changes to object2 on master2 are lost
Its as if until a write to object1 which was committed by master2 while out
of disk space, is modified on master1 and replicated to master2 that master2
is unable to replicate any changes out that originate on master2.
In the case of a slave running out of disk space, if it didn't fall into
sync right away the solution would be to blow away the ldap database and let
it do a full sync from the masters.
But in the case where one of the master servers runs out of disk space, what
should be done to bring them back in sync without loosing any changes?