Full_Name: Helge Milde Version: slapd 2.4.11 OS: Debian 5 URL: Submission from: (NULL) (81.175.2.164)
Hello,
In the last two days, slapd has randomly stopped working on our server, about one time every two or three hours. Before this, slapd had been running without any problems for about two months. slapd.conf has not been touched since then.
What happens is that slapd stops taking queries (postfix, dovecot, NSS etc. starts getting timeouts), and it starts using exactly 100% of one core until we restart it. After restart, everything works again (until this happens again).
It seems to only happen when slapd is under "stress" (we have an 8-core system with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU load I've seen is about 40% on one core). The last ldap search before slapd stops working is always the same: Oct 2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0 filter="(&(objectClass=posixGroup))" .. which should return about 1800 entries. Different connections might query this simultaneously, and it's a query that happens every ~10 seconds on average.
I've tried to reproduce this by doing .. # while [ 1 ]; do ldapsearch -x -H ldaps://.. -b 'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done .. for two minutes without any "luck" :). slapd just uses a bit more CPU, but never stops working.
With "loglevel stats stats2", I get nothing out of the logs: Oct 2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" ... time of "crash" (we're mid-search at this point); nothing gets logged until we do a restart: ... Oct 2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated. Oct 2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown) ... many entires like the one above, and then I seem to be getting the next entry from the same connection, right before it closes: ... Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net" Oct 2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown)
I've tried logging more than just stats, but the interesting loglevels also causes a big performance hit (trace and args), and I'm not really sure which of the other loglevels could be helpful. This is a live ldap server, on a server that also hosts other services, so I simply can't take the performance hit.
I also did a strace -p <slapd pid>, which only gives "futex(0x41c499e0, FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be useful, so here it is: http://www.pastie.org/639170
I understand that this probably isn't enough data to see what the problem is, and so I'm hoping to get some help in debugging this properly. As I've said, args/trace loglevel takes too much of a performance hit, but if this is the only way, we might reconcider.
Interesting parts of slapd.conf below. We use slapd.conf instead of cn=config, everything goes over TLS, we're AFAIK only using LDAPv3 binding, and we only have one database entry in slapd.conf.
moduleload back_hdb sizelimit 99999 tool-threads 8 threads 16 concurrency 32 backend hdb database hdb cachesize 60000 dbconfig set_cachesize 0 52428800 0 dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500 lastmod on checkpoint 512 30
Best regards, Helge Milde