Full_Name: Rodrigo Luiz Vargas Costa Version: 2.4.17 OS: CentOS release 5.2 (Final) URL: ftp://ftp.openldap.org/incoming/<TBD> Submission from: (NULL) (135.245.8.5)
Openldap developers,
I have being exchange some information at openldap lists where looks like some improvements are being done in replication for release 2.4.18.
The architecture I'm running has 2 machines in MirrorMode in the same subnet(at the same switch). These systems are part of a HA system sharing a VIP and where both machines have slapd running simultaneously(bind to any local interface) and only VIP is exchanged for HA purposes.
The issue I'm facing is related, in a general user view, is when I stop the secondary Provider2(master 2) for backup purposes using slapcat. The Provider1(master 1) continues to provide ldap service where some entrances can be created during the time backup is running(no consumer from Provider 2).
Even a small number of entrances are different when consumer in Provider 2 connects to Provider 1 then syncrepl enters in the full DB search as expected.
For definition purposes I have some memory limitations where I need to limit dncachesize for around 80% of DB entrances.
From a user perspective I see that after cache is filled system enters in some
state where synchronization doesn't happen anymore. For full reference(config, gdb, etc), please see file attached in FTP.
Then I see 2 issues :
1)Consumer from Provider2, even passed days and only a small number of differences for test purpose happen(no traffic), the syncrepl never ends and there isn't replication(Provider 1 stay continuously consuming 100% CPU); 2)Even I stop the Provider2(then its consumer) I do not see any change in Provider 1 activities. The CPU continues in 100% even passed days what suggest some hang in the thread or logic.
I compiled openldap with GDB symbols and then execute some traces in the threads during the state 2 report above. Looks like it stay looping forever locked in some thread lock.
I could also note that when in this situation the monitor cache, in a very slow pace, changes the cache in a single entrance. Being more specific :
dn: cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: monitoredObject creatorsName: modifiersName: createTimestamp: 20090821145848Z modifyTimestamp: 20090821145848Z monitoredInfo: bdb monitorIsShadow: TRUE namingContexts: ou=CONTENT,o=domain,c=fr readOnly: FALSE monitorOverlay: syncprov olmBDBEntryCache: 19920 olmBDBDNCache: 3896287 olmBDBIDLCache: 2 olmDbDirectory: /var/openldap-data/bdb1/ entryDN: cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: TRUE
Stays running in the values 3896287 and 3896288. Looks like the memory re-use is being too short causing locks that takes long time causing a non synchronization.
I made several GDB traces for different conditions. Please see ftp attachment file for details.
Thanks,
Rodrigo.
PS-> I could not put the file in the openldap ftp. It says device full. Please let me know how can I send this file.