A few weeks ago, we upgraded to OpenLDAP 2.4.19. All seemed well, for
about 4 weeks, and then within the last few days (starting Jan 14),
slapd has been dying on our replica servers. It doesn't seem to follow
any pattern, and the system seems fine when slapd dies; it isn't out of
memory, and doesn't show any load spikes. In our logs we get messages like:
Jan 14 22:18:34 slapd[7940]: ch_malloc of 14698087920909635104 bytes failed
(from ldap.log)
Jan 15 16:58:03 kernel: slapd[22663] general protection
rip:2ac667af65cc rsp:4659d970 error:0
Jan 16 15:42:34 kernel: slapd[10272] general protection
rip:2b71f6c6d9c4 rsp:45441980 error:0
Jan 16 17:20:01 kernel: slapd[7538] general protection rip:2aec51954ae0
rsp:449518d0 error:0
Jan 19 13:38:46 kernel: slapd[2821] general protection rip:2aeac3070ae0
rsp:4918f8d0 error:0
(from /var/log/messages)
from another replica server:
Jan 16 20:56:15 kernel: slapd[17948] general protection
rip:2aedf150d9c4 rsp:4154f980 error:0
Jan 18 01:42:46 kernel: slapd[9446] general protection rip:2ae369401ae0
rsp:454c08d0 error:0
Jan 19 13:04:29 kernel: slapd[25339] general protection
rip:2b9803877ae0 rsp:4b6bc8d0 error:0
We're running on RHEL 5.4, with Heimdal 1.2.1-3, OpenSSL 0.9.8k,
Cyrus-SASL 2.1.23, BDB 4.7.25 (with patches), libunwind 0.99 (for Google
tcmalloc), Google tcmalloc 1.3.
1. Is there any useful information that can be obtained from these log
entries, or do we simply need to change to a more verbose log level and
wait for slapd to die again?
2. If we need to change our log level, what is a suggested level? Right
now we're using "loglevel sync stats". Would it be wise to change the
log level to -1 (any)? These are production servers, and I imagine
that'd be a huge performance hit.
3. Also, we're logging asynchronously at the moment. Should we disable
this while debugging?
Thanks!