openLDAP software community,
I would like to report some issue I'm having. Since I'm not sure if this is a problem then I would like some help/comment about this behavior.
I configure slapd.conf to have a single provider(master) and a single consumer(slave). Please see attached the slapd.conf for each one for comments.
These files were prepared in accordance with OpenLdap 2.4 Administration Guide and I do not believe this is related with configuration.
I'm using BDB4.7 with all patches available and using OpenLDAP 2.4.15 HEAD since I'm also testing ITS#5860 resolution I posted sometime ago.
In this way it is supposed I have one of the latest releases. My system is running Linux Kernel 2.6, more specifically :
Linux brtldp12 2.6.18-92.1.22.el5PAE #1 SMP Tue Dec 16 12:36:25 EST 2008 i686 i686 i386 GNU/Linux
The DB I have into system is around 4 million entrances.
The behavior is the following :
1) Start slapd in provider(master) node with command : date; /usr/libexec/slapd -d 256 -h "ldap://10.142.15.41:389 ldap://10.142.15.171:389" -u ldap
2) Start slapd in consumer(slave) node with command : date; /usr/libexec/slapd -d 256 -h "ldap://10.142.15.42:389 ldap://10.142.15.172:389" -u ldap
3) Since there are 2 DBs, CONTENT and INDEX, see the consumer starting a search in these 2 DBs(used to verify synchronicity): conn=0 fd=16 ACCEPT from IP=10.142.15.42:52117 (IP=10.142.15.41:389) conn=1 fd=17 ACCEPT from IP=10.142.15.42:52118 (IP=10.142.15.41:389) conn=0 op=0 BIND dn="cn=admin,ou=content,o=alcatel,c=fr" method=128 conn=1 op=0 BIND dn="cn=admin,ou=indexes,o=alcatel,c=fr" method=128 conn=0 op=0 BIND dn="cn=admin,ou=content,o=alcatel,c=fr" mech=SIMPLE ssf=0 conn=1 op=0 BIND dn="cn=admin,ou=indexes,o=alcatel,c=fr" mech=SIMPLE ssf=0 conn=0 op=0 RESULT tag=97 err=0 text= conn=1 op=0 RESULT tag=97 err=0 text= conn=0 op=1 SRCH base="ou=content,o=alcatel,c=fr" scope=2 deref=0 filter="(objectClass=*)" conn=0 op=1 SRCH attr=* + conn=1 op=1 SRCH base="ou=indexes,o=alcatel,c=fr" scope=2 deref=0 filter="(objectClass=*)" conn=1 op=1 SRCH attr=* +
4) Use monitor interface to verify cache size consumption. Seeing that boundaries are respected into provider. A ldapsearch consumes around 28 minutes to end in each DB.
5) Monitor the slapd process CPU and memory usage in provider(master) with top. Seeing many sync_p in the provider threads and CPU usage by slapd process always small, normally between 11% to 108%.
6) Monitor the slapd process CPU and memory usage consumer(slave) with top. Seeing almost no sync_p and then much more CPU usage since there are 2 search happening with provider. CPU usage normally around 48% to 200%.
The problem is exactly after sometime passes and then slapd into consumer has CPU utilization almost fixed in 200% and the slapd process in provider becomes iddle(CPU 0% utilized by slapd).
This appears to indicate the search ended but something in consumer is not working ok since too much CPU is being utilized.
Other strange thing is if I start a ldapsearch with consumer(slave) the responsiveness is very low indicating something is really wrong. I put a ldapsearch and after some yours and killing the slapd process I could see that only around 50 thousands entraces were passed. This is a really slow performance.
Also looks like consumer(slave) process never ends the search/sync since CPU utilization from the 2 searches(2 DBs), starts to consume 100% of CPU each performing 200% CPU utilization even after many hours passed.
Other strange behavior I was not expecting, is that even the search is from the consumer to provider I see the Cache information, using monitor interface, growing in consumer. So I see in consumer something like :
dn: cn=Database 2,cn=Databases,cn=Monitor structuralObjectClass: monitoredObject creatorsName: modifiersName: createTimestamp: 20090310002542Z modifyTimestamp: 20090310002542Z monitoredInfo: bdb monitorIsShadow: TRUE namingContexts: ou=INDEXES,o=domain,c=fr readOnly: FALSE monitorUpdateRef: ldap://10.142.15.41:389 olmBDBEntryCache: 9999 olmBDBDNCache: 1309605 olmBDBIDLCache: 4715 olmDbDirectory: /var/openldap-data/bdb2/
And I didn't make any search to consumer(manual) DB. This is being caused by syncrepl. Not sure why since I wasn't expecting it. The cache information can be seen in the files attached.
Please let me know if you think this could be a real problem or some configuration could solve this behavior.
Thanks,
Rodrigo Costa.