Hello list,
So we are in the middle of a major upgrade of our OpenLDAP software, so it is a bit unfortunate that I have to track down issues at the same time.
os: Solaris 10u8 x86 old: openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES new: openldap-2.4.23 db-4.8.30.NC
We noticed that syncrepl stopped on pop01, pop03 and pop06 yesterday and fell behind. The only hints in slaplog was:
Sep 28 11:23:09 pop06.unix slapd[29027]: [ID 968320 local4.debug] do_syncrep2: L DAP_RES_INTERMEDIATE - NEW_COOKIE
Sep 28 11:24:44 pop06.unix slapd[29027]: [ID 763815 local4.debug] connection_inp ut: conn=123099 deferring operation: too many executing
Sep 28 11:24:44 pop06.unix slapd[29027]: [ID 763815 local4.debug] connection_inp ut: conn=123099 deferring operation: pending operations
Sep 28 11:24:48 pop06.unix last message repeated 72 times
and there were no more syncrepl messages until we restarted slapd, 2 hours later. I wonder if the syncrepl connection received "too many executing". Is that possible? Can we make it so sync connections get higher priority as it were. In this case, it is new-ldap syncrepl to old-ldap for loopback lookups (dovecot).
Now, I would guess that getting "too many executing" is undesirable. Googling around it seems that what happens is that; one connection has more than half of the connection-pool operations already, and gets deferred.
What does "one connection" mean? From one IP (all connections are over loopback, except for syncrepl), or is it operations from one-tcp-stream? Or it some other kind of cookie, like rid?
Can I get slapd to tell me which connection it actually means? Having looked at the sources, it does not seem to have that ability, but I could always add our own prints. At least to get the IP of the requester. (I tried "conns" in LogLevel, but it prints all select() calls, and is unfortunately unrealistic to run on live servers. Currently I have 'stats' running.)
Or rather than hacking at the sources, should I invest in getting the overlay "monitor" to run? Would it show why we receive "too many executing".
I have also noticed a considerable performance drop when moving from old version to new version, and not entirely sure if that is something we can do something about.
Following this email is the juicy parts of slapd on most of our slaves/loopback slapd.
--On Wednesday, September 29, 2010 11:33 AM +0900 Jorgen Lundman lundman@lundman.net wrote:
Hello list,
So we are in the middle of a major upgrade of our OpenLDAP software, so it is a bit unfortunate that I have to track down issues at the same time.
os: Solaris 10u8 x86 old: openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES new: openldap-2.4.23 db-4.8.30.NC
We noticed that syncrepl stopped on pop01, pop03 and pop06 yesterday and fell behind. The only hints in slaplog was:
Do you have different versions of OpenLDAP on the master vs the replicas? Or did you upgrade everything at once? How large is your database? How many entries does your database have? Are you using a disk cache or a memory cache for BDB?
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Do you have different versions of OpenLDAP on the master vs the replicas? Or did you upgrade everything at once? How large is your database? How many entries does your database have? Are you using a disk cache or a memory cache for BDB?
Unfortunately, yes. in this case, ldapmaster is 2.4.23, and so is ldapslave01; 2.4.23, but pop06 has loopback ldap of 2.3.41.
This is not supported and clearly undesirable, but with so many hosts it takes a long time to schedule maintenances and upgrade. Plus we admins need to sleep occasionally.
The database is:
-rw------- 1 root root 2.4G Sep 30 08:55 id2entry.bdb
with total of about 7GB. (including bdb environment files, but no transaction files).
We already upgraded machines to 8GB RAM from a previous conversation. DB stats report (ldapmaster):
4GB Total cache size 8 Number of caches 8 Maximum number of caches 512MB Pool individual cache size 0 Maximum memory-mapped file size 0 Maximum open file descriptors 0 Maximum sequential buffer writes 0 Sleep after writing maximum sequential buffers 0 Requested pages mapped into the process' address space 102M Requested pages found in the cache (99%) 204394 Requested pages not found in the cache 4282 Pages created in the cache 204394 Pages read into the cache 402689 Pages written from the cache to the backing file 0 Clean pages forced from the cache 0 Dirty pages forced from the cache 0 Dirty pages written by trickle-sync thread 208575 Current total page count 208559 Current clean page count 16 Current dirty page count 524296 Number of hash buckets used for page location 4096 Assumed page size used 102M Total number of times hash chains searched for a page (102254300) 18 The longest hash chain searched for a page 125M Total number of hash chain entries checked for page (125288450) 0 The number of hash bucket locks that required waiting (0%) 0 The maximum number of times any hash bucket lock was waited for (0%) 50 The number of region locks that required waiting (0%) 0 The number of buffers frozen 0 The number of buffers thawed 0 The number of frozen buffers freed 208752 The number of page allocations
and on pop loopback ldap:
80MB 2KB 912B Total cache size. 1 Number of caches. 80MB 8KB Pool individual cache size. 0 Requested pages mapped into the process' address space. 1439M Requested pages found in the cache (99%). 18M Requested pages not found in the cache. 1954 Pages created in the cache. 18M Pages read into the cache. 255279 Pages written from the cache to the backing file. 18M Clean pages forced from the cache. 40207 Dirty pages forced from the cache. 0 Dirty pages written by trickle-sync thread. 9069 Current total page count. 9055 Current clean page count. 14 Current dirty page count. 8191 Number of hash buckets used for page location. 1476M Total number of times hash chains searched for a page. 9 The longest hash chain searched for a page. 3383M Total number of hash buckets examined for page location. 2992M The number of hash bucket locks granted without waiting. 44448 The number of hash bucket locks granted after waiting. 3233 The maximum number of times any hash bucket lock was waited for. 63M The number of region locks granted without waiting. 76892 The number of region locks granted after waiting. 18M The number of page allocations. 36M The number of hash buckets examined during allocations 728 The max number of hash buckets examined for an allocation 18M The number of pages examined during allocations 360 The max number of pages examined for an allocation (much smaller as it shares resources. Also only syncrepls the mail tree from LDAP).
Are you using a disk cache or a memory cache for BDB?
You can do that now? I'm afraid that the only BDB specific work I have done, are the DB_CONFIG entries shown in the slapd.conf previous. Repeated here for your convenience:
set_lk_detect DB_LOCK_DEFAULT set_lg_max 52428800 set_cachesize 4 0 8 set_flags db_log_autoremove set_lk_max_objects 1500 set_lk_max_locks 1500 set_lk_max_lockers 1500
Now, looking at "too many executing", we do need to do something about it. It happens on both old and new versions, so it could just be that we are at capacity already.
If I grep for the message on ldapmaster, we get 0 for all days. Of course, master only sycnrepls to the 4 slaves. The 4 slaves in turn, syncrepls to all loopback ldaps, but also get the occasional direct request.
# gzgrep "too many executing" /var/log/slaplog-201009$day.gz | wc -l
day master slave01 slave02 pop06 29 0 16 3689 66 28 0 21 2916 65 27 0 0 582 27 26 0 0 839 2
The difference between slave01 and slave02 here is the "idlcachesize 15000" line. We added it to slave01 last week, and I did the 6 am LiveUP to add it to slave02 today. Looks like that is a step in the right direction.
--On Thursday, September 30, 2010 9:22 AM +0900 Jorgen Lundman lundman@lundman.net wrote:
The difference between slave01 and slave02 here is the "idlcachesize 15000" line. We added it to slave01 last week, and I did the 6 am LiveUP to add it to slave02 today. Looks like that is a step in the right direction.
Have you looked at the monitor backend to see how full your IDL cache is then? Perhaps you may want to increase it additionally. Knowing how full it is can help give you an idea of that.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
openldap-technical@openldap.org