slapd leaking memory on i386?

List overview All Threads
Download

newer

older

Another DIT as subtree

pwdPolicyAttribute not found any...

John Morrissey

13 Apr 2009 13 Apr '09

1:53 p.m.

2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has been stable on amd64/x86_64 for us, but I'm encountering problems with 2.4.16 on i386.

slapd will run happily for several hours at a RSS of ~1.2GB, but then will suddenly gobble memory until it hits the i386 address space limit of 3GB and dies due to random segfaults as memory allocation fails. BDB caches are set to ~1.25GB total and the entry cache to 10,000 entries. I'm not sure how long it takes slapd to consume the extra 2GB of memory, but it's definitely less than 20 minutes according to sar(1).

I have a couple core files from when this happens. Are there structures that are likely suspects for unchecked growth (the entry and BDB caches, for example) that I can poke at with a debugger to get an idea what's going on?

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

Show replies by date

Howard Chu

13 Apr 13 Apr

4:28 p.m.

John Morrissey wrote:

...

2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has been stable on amd64/x86_64 for us, but I'm encountering problems with 2.4.16 on i386.

On what OS?

I've been seeing similar issues on Linux, but using heap profilers / leak checkers don't turn up any results. At least in the Linux case, it appears to be general heap fragmentation caused by continual mmap/munmap of malloc'd memory blocks.

...

slapd will run happily for several hours at a RSS of ~1.2GB, but then will suddenly gobble memory until it hits the i386 address space limit of 3GB and dies due to random segfaults as memory allocation fails. BDB caches are set to ~1.25GB total and the entry cache to 10,000 entries. I'm not sure how long it takes slapd to consume the extra 2GB of memory, but it's definitely less than 20 minutes according to sar(1).

I have a couple core files from when this happens. Are there structures that are likely suspects for unchecked growth (the entry and BDB caches, for example) that I can poke at with a debugger to get an idea what's going on?

john

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Quanah Gibson-Mount

4:47 p.m.

--On Monday, April 13, 2009 4:28 PM -0700 Howard Chu hyc@symas.com wrote:

...

John Morrissey wrote:

...
2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has been stable on amd64/x86_64 for us, but I'm encountering problems with 2.4.16 on i386.

On what OS?

I've been seeing similar issues on Linux, but using heap profilers / leak checkers don't turn up any results. At least in the Linux case, it appears to be general heap fragmentation caused by continual mmap/munmap of malloc'd memory blocks.

Does using something like tcmalloc instead of glibc help with this?

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

John Morrissey

14 Apr 14 Apr

11:55 a.m.

On Mon, Apr 13, 2009 at 04:28:58PM -0700, Howard Chu wrote:

...

John Morrissey wrote:

...
2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has been stable on amd64/x86_64 for us, but I'm encountering problems with 2.4.16 on i386.

On what OS?

Linux.

...

I've been seeing similar issues on Linux, but using heap profilers / leak checkers don't turn up any results. At least in the Linux case, it appears to be general heap fragmentation caused by continual mmap/munmap of malloc'd memory blocks.

ugh. Finally caught this machine in the act. By the time I got to it, it had ~200 file descriptors left open (each of our consumers normally has 4k+), most of which were:

slapd 32667 openldap 5466u sock 0,4 113001170 can't identify protocol

slapd accepts connections but returns no results for any operations. For example, strace(1) shows ldapsearch(1) connecting successfully, sending a bind request, and then waiting forever on the result.

Each worker thread was consuming a few percent of CPU and was in a tight loop failing to read from connections:

futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) send(4, "<167>Apr 14 16:59:34 slapd[32667]"..., 74, MSG_NOSIGNAL) = 74 | 00000 3c 31 36 37 3e 41 70 72 20 31 34 20 31 36 3a 35 <167>Apr 14 16:5 | | 00010 39 3a 33 34 20 73 6c 61 70 64 5b 33 32 36 36 37 9:34 sla pd[32667 | | 00020 5d 3a 20 63 6f 6e 6e 65 63 74 69 6f 6e 5f 72 65 ]: conne ction_re | | 00030 61 64 28 34 35 35 30 29 3a 20 6e 6f 20 63 6f 6e ad(4550) : no con | | 00040 6e 65 63 74 69 6f 6e 21 0a 00 nection! .. | futex(0xb7c5dc4c, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x895f1b4, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0x895f1b4, FUTEX_WAKE_PRIVATE, 1) = 0 time(NULL) = 1239728375 futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL

For reference, slapd has been emitting lots of these errors, but their frequency increases by more than an order of magnitude when slapd begins to wedge:

# sed -n '/no connection/s/^Apr 14 (..).*/\1/p' slapd.20090414|sort|uniq -c [...] 46916 10 48511 11 50828 12 53492 13 229907 14 1040360 15 808565 16

At this point, slapd's VSIZE had increased to ~2.6GB, which was consistent with /proc/PID/smaps. Most of its mappings were 1Mbyte or 2Mbytes, with a 1GB region for its SHM cache:

# awk '/Size/ {print $2}' smaps |sort|uniq -c|sort -rn|head 307 2048 149 1024 67 4 18 8 13 1020 [...]

I saved a core and backtrace of this slapd image. I blocked off all LDAP connections to this host, and slapd's memory consumption seems stable for the moment, but is still not sending responses to any LDAP operations. Is there anything else I can poke at to help troubleshoot this? Howard, do you still think this is due to churn-related heap fragmentation?

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

John Morrissey

21 Apr 21 Apr

6:25 a.m.

On Tue, Apr 14, 2009 at 02:55:29PM -0400, John Morrissey wrote:

...

On Mon, Apr 13, 2009 at 04:28:58PM -0700, Howard Chu wrote:

...
John Morrissey wrote:

...
2.4.16 (with BDB4.7 + patches and the POSIX threads build options) has been stable on amd64/x86_64 for us, but I'm encountering problems with 2.4.16 on i386.

On what OS?

Linux.

...
I've been seeing similar issues on Linux, but using heap profilers / leak checkers don't turn up any results. At least in the Linux case, it appears to be general heap fragmentation caused by continual mmap/munmap of malloc'd memory blocks.

ugh. Finally caught this machine in the act. By the time I got to it, it had ~200 file descriptors left open (each of our consumers normally has 4k+), most of which were:

slapd 32667 openldap 5466u sock 0,4 113001170 can't identify protocol

slapd accepts connections but returns no results for any operations. For example, strace(1) shows ldapsearch(1) connecting successfully, sending a bind request, and then waiting forever on the result.

Each worker thread was consuming a few percent of CPU and was in a tight loop failing to read from connections:

futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) send(4, "<167>Apr 14 16:59:34 slapd[32667]"..., 74, MSG_NOSIGNAL) = 74 | 00000 3c 31 36 37 3e 41 70 72 20 31 34 20 31 36 3a 35 <167>Apr 14 16:5 | | 00010 39 3a 33 34 20 73 6c 61 70 64 5b 33 32 36 36 37 9:34 sla pd[32667 | | 00020 5d 3a 20 63 6f 6e 6e 65 63 74 69 6f 6e 5f 72 65 ]: conne ction_re | | 00030 61 64 28 34 35 35 30 29 3a 20 6e 6f 20 63 6f 6e ad(4550) : no con | | 00040 6e 65 63 74 69 6f 6e 21 0a 00 nection! .. | futex(0xb7c5dc4c, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x895f1b4, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0x895f1b4, FUTEX_WAKE_PRIVATE, 1) = 0 time(NULL) = 1239728375 futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7c5d80c, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 futex(0xb7c5d80c, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xb7c5dc4c, FUTEX_WAIT_PRIVATE, 2, NULL

For reference, slapd has been emitting lots of these errors, but their frequency increases by more than an order of magnitude when slapd begins to wedge:

# sed -n '/no connection/s/^Apr 14 (..).*/\1/p' slapd.20090414|sort|uniq -c [...] 46916 10 48511 11 50828 12 53492 13 229907 14 1040360 15 808565 16

At this point, slapd's VSIZE had increased to ~2.6GB, which was consistent with /proc/PID/smaps. Most of its mappings were 1Mbyte or 2Mbytes, with a 1GB region for its SHM cache:

# awk '/Size/ {print $2}' smaps |sort|uniq -c|sort -rn|head 307 2048 149 1024 67 4 18 8 13 1020 [...]

I saved a core and backtrace of this slapd image. I blocked off all LDAP connections to this host, and slapd's memory consumption seems stable for the moment, but is still not sending responses to any LDAP operations. Is there anything else I can poke at to help troubleshoot this? Howard, do you still think this is due to churn-related heap fragmentation?

Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

Howard Chu

10:33 a.m.

John Morrissey wrote:

...

Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

If it's so easily reproducible you should be able to recreate this while running a heap profiler. Can you get hold of google's tcmalloc and run with that? If it's not already on your platform, you can get it here

http://code.google.com/p/google-perftools/

You don't need to recompile slapd to use it. Just set LD_PRELOAD=/path/to/tcmalloc before running slapd. (But you still need debugging/symbols.) See the docs here

http://google-perftools.googlecode.com/svn/trunk/doc/heapprofile.html

you should also set HEAPPROFILE to the path where to dump the profile log.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

John Morrissey

1:42 p.m.

On Tue, Apr 21, 2009 at 10:33:23AM -0700, Howard Chu wrote:

...

John Morrissey wrote:

...
Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

If it's so easily reproducible you should be able to recreate this while running a heap profiler. Can you get hold of google's tcmalloc and run with that? If it's not already on your platform, you can get it here

Giving that a shot now; we should have something more in a few days. Thanks, Howard.

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

John Morrissey

1 May 1 May

2:29 p.m.

On Tue, Apr 21, 2009 at 10:33:23AM -0700, Howard Chu wrote:

...

John Morrissey wrote:

...
Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

If it's so easily reproducible you should be able to recreate this while running a heap profiler. Can you get hold of google's tcmalloc and run with that?

Ran 2.4.16+tcmalloc for about a week with heap profiling enabled. Memory size was stable. slapd caught a SIGABRT due to an assertion failure in connection_close() after about 7 days of uptime. Backtrace is log.assertion-failure.

Restarted slapd; it ran for two days before no longer responding to incoming connections (the connection would be accepted, but all LDAP operations would block). All worker threads seem to be blocking on mutex acquisition in bdb_cache_lru_link(). One thread was chewing lots of CPU. Backtrace is log.bdb-cache-locks.

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

Howard Chu

2:38 p.m.

John Morrissey wrote:

...

On Tue, Apr 21, 2009 at 10:33:23AM -0700, Howard Chu wrote:

...
John Morrissey wrote:

...
Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

If it's so easily reproducible you should be able to recreate this while running a heap profiler. Can you get hold of google's tcmalloc and run with that?

Ran 2.4.16+tcmalloc for about a week with heap profiling enabled. Memory size was stable.

OK, then your issue is with the default libc malloc working poorly wrt heap fragmentation; you should probably just use tcmalloc from now on.

...

slapd caught a SIGABRT due to an assertion failure in connection_close() after about 7 days of uptime. Backtrace is log.assertion-failure.

Separate issue, probably should submit to ITS.

...

Restarted slapd; it ran for two days before no longer responding to incoming connections (the connection would be accepted, but all LDAP operations would block). All worker threads seem to be blocking on mutex acquisition in bdb_cache_lru_link(). One thread was chewing lots of CPU. Backtrace is log.bdb-cache-locks.

Also a separate issue.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

John Morrissey

2:55 p.m.

On Fri, May 01, 2009 at 02:38:52PM -0700, Howard Chu wrote:

...

John Morrissey wrote:

...
On Tue, Apr 21, 2009 at 10:33:23AM -0700, Howard Chu wrote:

...
John Morrissey wrote:

...
Any response to this, Howard? slapd finally consumed enough memory on this machine that the kernel OOM killer terminated it, but this problem is trivial for us to reproduce (happens after a few days of slapd uptime).

If it's so easily reproducible you should be able to recreate this while running a heap profiler. Can you get hold of google's tcmalloc and run with that?

Ran 2.4.16+tcmalloc for about a week with heap profiling enabled. Memory size was stable.

OK, then your issue is with the default libc malloc working poorly wrt heap fragmentation; you should probably just use tcmalloc from now on.

nod.

...

...
slapd caught a SIGABRT due to an assertion failure in connection_close() after about 7 days of uptime. Backtrace is log.assertion-failure.

Separate issue, probably should submit to ITS.

http://www.openldap.org/its/index.cgi/?findid=6089

...

...
Restarted slapd; it ran for two days before no longer responding to incoming connections (the connection would be accepted, but all LDAP operations would block). All worker threads seem to be blocking on mutex acquisition in bdb_cache_lru_link(). One thread was chewing lots of CPU. Backtrace is log.bdb-cache-locks.

Also a separate issue.

http://www.openldap.org/its/index.cgi/?findid=6090

john

-- John Morrissey _o /\ ---- __o jwm@horde.net _-< _ / \ ---- < , www.horde.net/ __(_)/_(_)________/ _______(_) /_(_)__

5922

Age (days ago)

5940

Last active (days ago)

openldap-software@openldap.org

9 comments

3 participants

tags (0)

participants (3)

Howard Chu
John Morrissey
Quanah Gibson-Mount