OpenLdap group,
I'm having a possible issue that could be a problem. I have a DB with around 4 million entrances. In my slapd.conf I use the following cache constraints :
#Cache values cachesize 10000 dncachesize 3000000 idlcachesize 10000 cachefree 10
I'm also running the system in 2 machines in MirrorMode without problems about this configuration.
My DB has exactly 3882992 entrances. In this way the dncachesize is smaller than the number of records. After I move the dncache constraint to a size smaller than the number of records(memory concern) I start to have some issues related with ldapsearch, for example.
After the number of entrances in cache match the constraints(always pass a little) the system hangs for new searches. It appears to be for records "not yet" cached. If a record not cached is searched the ldapsearch bind but hangs during search. One example can be seen below :
[root@brtldp11 ~]# time ldapsearch -LLL -x -D "cn=admin,ou=CONTENT,o=domain,c=fr" -w secret -b "ou=CONTENT,o=domain,c=fr" -H ldap://10.142.15.170:389 'pnnumber=+554184011071' real 0m40.140s user 0m0.003s sys 0m0.001s
Just after I press CTRL-C the command stopped. It would stay forever in this state. This happens after I ldapsearch the full DB and cache is filled.
If I then search this same record in the mirror I have the return very fast. See example :
[root@brtldp11 ~]# time ldapsearch -LLL -x -D "cn=admin,ou=CONTENT,o=domain,c=fr" -w secret -b "ou=CONTENT,o=domain,c=fr" -H ldap://10.142.15.172:389 'pnnumber=+554184011071' dn: pnnumber=\2B554184011071,uid=1219843774965\2B554184011071,ou=REPOSITORY,ou =CONTENT,o=domain,c=fr subpnid: 0 pntype: 2 pncaps: 7 objectClass: phoneinfo pnnumber: +554184011071
real 0m0.257s user 0m0.002s sys 0m0.003s
So this is doesn't appear to be a record problem since also both systems are in mirror mode. Also the return is in a reasonable time, in around 257 miliseconds. After cached it can even be faster.
See the number of entrances I could search before system hangs :
[root@brtldp12 ~]# wc -l /backup/temp2.txt 3078804 /backup/temp2.txt
The number was 3,078,804 records since even dncache boundary is 3,000,000 it always pass a little.
Appears that if a cache boundary is smaller than number of DB records slapd can hang when searching for new records not yet cached. Since I have multiple DBs with this order of records I needed to impose this boundary smaller than the number of records. My expectation was some performance degradation after cache is filled but not that system could hang(at least for non cached records).
After this situation happens the slapd process does not ends correctly only after kill -9 is given in the process.
I would like to know if someone already passed by this situation and I believe it can be easily reproduced with cache configuration smaller than the number of records. I believe any DB in this situation could reproduce the same behavior.
Any comments if this could be a configuration issue or some other related issue? Would this be a ITS?
Thanks,
Rodrigo.
Rodrigo Costa wrote:
OpenLdap group,
I'm having a possible issue that could be a problem. I have a DB with around 4 million entrances. In my slapd.conf I use the following cache constraints :
Any comments if this could be a configuration issue or some other related issue? Would this be a ITS?
A number of dncache issues have been fixed already in CVS.
On Monday 15 June 2009 04:44:21 Rodrigo Costa wrote:
OpenLdap group,
I'm having a possible issue that could be a problem. I have a DB with around 4 million entrances. In my slapd.conf I use the following cache constraints :
#Cache values cachesize 10000 dncachesize 3000000 idlcachesize 10000 cachefree 10
Your idlcachesize might be too small for your cachesize (but, you don't say which backend you are using). You haven't listed any Berkeley DB tuning.
I'm also running the system in 2 machines in MirrorMode without problems about this configuration.
My DB has exactly 3882992 entrances. In this way the dncachesize is smaller than the number of records. After I move the dncache constraint to a size smaller than the number of records(memory concern) I start to have some issues related with ldapsearch, for example.
After the number of entrances in cache match the constraints(always pass a little) the system hangs for new searches. It appears to be for records "not yet" cached. If a record not cached is searched the ldapsearch bind but hangs during search. One example can be seen below :
[root@brtldp11 ~]# time ldapsearch -LLL -x -D "cn=admin,ou=CONTENT,o=domain,c=fr" -w secret -b "ou=CONTENT,o=domain,c=fr" -H ldap://10.142.15.170:389 'pnnumber=+554184011071' real 0m40.140s user 0m0.003s sys 0m0.001s
Just after I press CTRL-C the command stopped. It would stay forever in this state. This happens after I ldapsearch the full DB and cache is filled.
But, not every entry in the DB is cached. slapd needs to access the remaining entries by some means to determine if they match the filter or not.
If I then search this same record in the mirror I have the return very fast. See example :
[root@brtldp11 ~]# time ldapsearch -LLL -x -D "cn=admin,ou=CONTENT,o=domain,c=fr" -w secret -b "ou=CONTENT,o=domain,c=fr" -H ldap://10.142.15.172:389 'pnnumber=+554184011071' dn: pnnumber=\2B554184011071,uid=1219843774965\2B554184011071,ou=REPOSITORY,ou =CONTENT,o=domain,c=fr subpnid: 0 pntype: 2 pncaps: 7 objectClass: phoneinfo pnnumber: +554184011071
real 0m0.257s user 0m0.002s sys 0m0.003s
Most likely because the OS filesystem cache has the index cached in memory. You may want to consider filling the OS filesystem cache (e.g. by copying files larger than the total system memory) to check ...
So this is doesn't appear to be a record problem since also both systems are in mirror mode. Also the return is in a reasonable time, in around 257 miliseconds. After cached it can even be faster.
What is cached? Cached in which cache? Are you sure? How did you check?
See the number of entrances I could search before system hangs :
[root@brtldp12 ~]# wc -l /backup/temp2.txt 3078804 /backup/temp2.txt
How did you populate this file? With what search (specifically the filter).
The number was 3,078,804 records since even dncache boundary is 3,000,000 it always pass a little.
Appears that if a cache boundary is smaller than number of DB records slapd can hang when searching for new records not yet cached. Since I have multiple DBs with this order of records I needed to impose this boundary smaller than the number of records. My expectation was some performance degradation after cache is filled but not that system could hang(at least for non cached records).
You may be thrashing the Berkeley DB cache, if your indexes are too big for the cachesize you have set in DB_CONFIG (if you haven't then the default of 256k is definitely way too much for 3 million records).
Also, the fact that you seem to only be searching on the naming attribute may be giving you a false idea of the real performance of your directory within the dncachesize limits (I am not sure if slapd can answer searches on the naming attribute faster from the dn cache than searchs on a non-naming attribute ...).
After this situation happens the slapd process does not ends correctly only after kill -9 is given in the process.
I would like to know if someone already passed by this situation and I believe it can be easily reproduced with cache configuration smaller than the number of records. I believe any DB in this situation could reproduce the same behavior.
Anyone running an OpenLDAP directory with the bdb or hdb backend with more than 10 000 entries who doesn't know what their Berkeley DB configuration is like in terms of DB cache is looking for problems.
Regards, Buchan
openldap-software@openldap.org