I am using openldap-2.4.11 with syncrepl n-way multimaster replication and I am seeing some strange behavior if one of the masters runs out of disk space. I would expect this to be handled much the same way a master being offline would be handled but it behaves different.
We are doing some failure testing on our new ldap infrastructure to prevent problems before they happen and to be able to improve our monitoring. On of the test cases I came up with is what happens on a master or slave if the disk the ldap database is stored on becomes full. Now granted we do monitor for this normally but if somehow it was missed or filled up very fast we would like to know now what will happen.
So here's the setup:
RHEL 5 i386 VMWare Guest openldap-2.4.11 (custom RPM, all backends, all overlays, monitor as module, back ldap as module) BDB backend overlay accesslog overlay ppolicy overlay syncprov overlay unique overlay dynlist overlay refint overlay memberof
master1 and master2 replicate to one another with mirror mode slave2 uses master2 as its replica provider (load balancing later)
master1 <-- mirror mode on --> master2 --> slave2 (updates chaned to master2)
Scenario 1: master2 runs out of disk space, ldap modify request is issued against master2 Result: master2 performs the ldapmodify but master1 and slave2 are not notified
Scenario 1a: disk space is freed up on master2 Result: change is not replicated to master1 and slave2
Scenario 1b: disk space is freed up on master2 and master2 is restarted Result: change is still not replicated to master1 and slave2
Scenario 1c: disk space is freed up on master2 and master1 or slave2 restarted Result: change is still not replicated to master1 and slave2
Scenario 2: master2 runs out of disk space, ldap modify request is issued against master1 Result: master1, master2 and slave2 are ALL updated
It gets even worse as well. Additional changes against multiple objects on master2 do not get propogated to master1 and slave2. The only thing that seems to bring master2 back into sync is to write a change to any objects modified during the time when the disk was full.
Scenario 3: master2 runs out of disk space, ldap modify request is issued against master2 for object1 Result: master2 performs modify but does not replicate it to master1 and slave2
Scenario 3a: disk space is freed on master2, ldap modify request is issued on master2 for object2 Result: object2 is modified but changes are not replicated to master1 and slave2
Scenario 3b: disk space is freed on master2, ldap modify is issued against master1 for object1 Result: master1 performs modify and replicates to master2, master2 replicates to slave2, changes to object1 on master2 are lost
Scenario 3c: disk space is freed on master2, ldap modify is issued against master1 for object2 Result: master1 performs modify and replicates to master2, master2 replicates to slave2, changes to object2 on master2 are lost
Its as if until a write to object1 which was committed by master2 while out of disk space, is modified on master1 and replicated to master2 that master2 is unable to replicate any changes out that originate on master2.
In the case of a slave running out of disk space, if it didn't fall into sync right away the solution would be to blow away the ldap database and let it do a full sync from the masters.
But in the case where one of the master servers runs out of disk space, what should be done to bring them back in sync without loosing any changes?
openldap-software@openldap.org