We run 4 2.4.16 servers as 2 provider/consumer pairs, one pair for our staff systems and one pair for our teaching facilities. They are all on Solaris10u7 xen virtual hosts.
The staff pair run fine
The consumer on the teaching pair runs fine The provider on the teaching pair runs fine until it gets hit by a heavy load, eg start of a lab when ~100 PCs try and authenticate their user. At this point it refuses to serve LDAP requests. Traffic is still coming in to the box and existing connections seem OK. The break point is about 35PCs, below that there isn't a problem. Restarting slapd cures the problem and off we go until the start of the next big lab.
I've run at various log levels but not been able to see any obvious messages. All I see, even when everything is fine, are messages of the form
send_search_entry: conn 11639 ber write failed. connection_read(38): no connection!
The slapd.conf (minux the syncprov bit) is:
include /usr/local/etc/openldap/schema/core.schema include /usr/local/etc/openldap/schema/cosine.schema include /usr/local/etc/openldap/schema/inetorgperson.schema include /usr/local/etc/openldap/schema/nis.schema include /usr/local/etc/openldap/schema/duaconf.schema include /usr/local/etc/openldap/schema/local.schema
pidfile /var/openldap/run/slapd.pid argsfile /var/openldap/run/slapd.args
conn_max_pending 200 idletimeout 60
sizelimit 2000
loglevel 256
database bdb suffix "dc=my,dc=domain" rootdn "cn=me,dc=my,dc=domain" rootpw {SSHA}guess directory /var/openldap/openldap-data
index cn,entryCSN,entryUUID,gidNumber,ipHostNumber,memberUid eq index objectclass,uid,uidNumber,uniqueMember eq
cachefree 16 cachesize 1500 checkpoint 0 60 dncachesize 1500 idlcachesize 3000
access to attrs=userPassword by self write by anonymous auth by dn.base="cn=fred,ou=Profile,dc=my,dc=domain" read by * none access to * by self write by users read by * read
The only entry in DB_CONFIG is set_cachesize 0 26214400 0
cache hits are at 99%
I'm stumped for a cause/solution, can anyone either give me a pointer as to what to look for in the logs or suggest a possible cause. Could it be hitting the 256 open file limit?
Thanks