Do you have different versions of OpenLDAP on the master vs the replicas? Or did you upgrade everything at once? How large is your database? How many entries does your database have? Are you using a disk cache or a memory cache for BDB?
Unfortunately, yes. in this case, ldapmaster is 2.4.23, and so is ldapslave01; 2.4.23, but pop06 has loopback ldap of 2.3.41.
This is not supported and clearly undesirable, but with so many hosts it takes a long time to schedule maintenances and upgrade. Plus we admins need to sleep occasionally.
The database is:
-rw------- 1 root root 2.4G Sep 30 08:55 id2entry.bdb
with total of about 7GB. (including bdb environment files, but no transaction files).
We already upgraded machines to 8GB RAM from a previous conversation. DB stats report (ldapmaster):
4GB Total cache size 8 Number of caches 8 Maximum number of caches 512MB Pool individual cache size 0 Maximum memory-mapped file size 0 Maximum open file descriptors 0 Maximum sequential buffer writes 0 Sleep after writing maximum sequential buffers 0 Requested pages mapped into the process' address space 102M Requested pages found in the cache (99%) 204394 Requested pages not found in the cache 4282 Pages created in the cache 204394 Pages read into the cache 402689 Pages written from the cache to the backing file 0 Clean pages forced from the cache 0 Dirty pages forced from the cache 0 Dirty pages written by trickle-sync thread 208575 Current total page count 208559 Current clean page count 16 Current dirty page count 524296 Number of hash buckets used for page location 4096 Assumed page size used 102M Total number of times hash chains searched for a page (102254300) 18 The longest hash chain searched for a page 125M Total number of hash chain entries checked for page (125288450) 0 The number of hash bucket locks that required waiting (0%) 0 The maximum number of times any hash bucket lock was waited for (0%) 50 The number of region locks that required waiting (0%) 0 The number of buffers frozen 0 The number of buffers thawed 0 The number of frozen buffers freed 208752 The number of page allocations
and on pop loopback ldap:
80MB 2KB 912B Total cache size. 1 Number of caches. 80MB 8KB Pool individual cache size. 0 Requested pages mapped into the process' address space. 1439M Requested pages found in the cache (99%). 18M Requested pages not found in the cache. 1954 Pages created in the cache. 18M Pages read into the cache. 255279 Pages written from the cache to the backing file. 18M Clean pages forced from the cache. 40207 Dirty pages forced from the cache. 0 Dirty pages written by trickle-sync thread. 9069 Current total page count. 9055 Current clean page count. 14 Current dirty page count. 8191 Number of hash buckets used for page location. 1476M Total number of times hash chains searched for a page. 9 The longest hash chain searched for a page. 3383M Total number of hash buckets examined for page location. 2992M The number of hash bucket locks granted without waiting. 44448 The number of hash bucket locks granted after waiting. 3233 The maximum number of times any hash bucket lock was waited for. 63M The number of region locks granted without waiting. 76892 The number of region locks granted after waiting. 18M The number of page allocations. 36M The number of hash buckets examined during allocations 728 The max number of hash buckets examined for an allocation 18M The number of pages examined during allocations 360 The max number of pages examined for an allocation (much smaller as it shares resources. Also only syncrepls the mail tree from LDAP).
Are you using a disk cache or a memory cache for BDB?
You can do that now? I'm afraid that the only BDB specific work I have done, are the DB_CONFIG entries shown in the slapd.conf previous. Repeated here for your convenience:
set_lk_detect DB_LOCK_DEFAULT set_lg_max 52428800 set_cachesize 4 0 8 set_flags db_log_autoremove set_lk_max_objects 1500 set_lk_max_locks 1500 set_lk_max_lockers 1500
Now, looking at "too many executing", we do need to do something about it. It happens on both old and new versions, so it could just be that we are at capacity already.
If I grep for the message on ldapmaster, we get 0 for all days. Of course, master only sycnrepls to the 4 slaves. The 4 slaves in turn, syncrepls to all loopback ldaps, but also get the occasional direct request.
# gzgrep "too many executing" /var/log/slaplog-201009$day.gz | wc -l
day master slave01 slave02 pop06 29 0 16 3689 66 28 0 21 2916 65 27 0 0 582 27 26 0 0 839 2
The difference between slave01 and slave02 here is the "idlcachesize 15000" line. We added it to slave01 last week, and I did the 6 am LiveUP to add it to slave02 today. Looks like that is a step in the right direction.