A few weeks ago, we upgraded to OpenLDAP 2.4.19. All seemed well, for about 4 weeks, and then within the last few days (starting Jan 14), slapd has been dying on our replica servers. It doesn't seem to follow any pattern, and the system seems fine when slapd dies; it isn't out of memory, and doesn't show any load spikes. In our logs we get messages like:
Jan 14 22:18:34 slapd[7940]: ch_malloc of 14698087920909635104 bytes failed (from ldap.log) Jan 15 16:58:03 kernel: slapd[22663] general protection rip:2ac667af65cc rsp:4659d970 error:0 Jan 16 15:42:34 kernel: slapd[10272] general protection rip:2b71f6c6d9c4 rsp:45441980 error:0 Jan 16 17:20:01 kernel: slapd[7538] general protection rip:2aec51954ae0 rsp:449518d0 error:0 Jan 19 13:38:46 kernel: slapd[2821] general protection rip:2aeac3070ae0 rsp:4918f8d0 error:0 (from /var/log/messages)
from another replica server: Jan 16 20:56:15 kernel: slapd[17948] general protection rip:2aedf150d9c4 rsp:4154f980 error:0 Jan 18 01:42:46 kernel: slapd[9446] general protection rip:2ae369401ae0 rsp:454c08d0 error:0 Jan 19 13:04:29 kernel: slapd[25339] general protection rip:2b9803877ae0 rsp:4b6bc8d0 error:0
We're running on RHEL 5.4, with Heimdal 1.2.1-3, OpenSSL 0.9.8k, Cyrus-SASL 2.1.23, BDB 4.7.25 (with patches), libunwind 0.99 (for Google tcmalloc), Google tcmalloc 1.3.
1. Is there any useful information that can be obtained from these log entries, or do we simply need to change to a more verbose log level and wait for slapd to die again? 2. If we need to change our log level, what is a suggested level? Right now we're using "loglevel sync stats". Would it be wise to change the log level to -1 (any)? These are production servers, and I imagine that'd be a huge performance hit. 3. Also, we're logging asynchronously at the moment. Should we disable this while debugging?
Thanks!