We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
The Problem is that this system has to be highly available, even for writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
So far the only way I've been able to mitigate the issues is to reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
Thanks in advance
Hi,
So in my decreasing order of preference due to decreasing accuracy/easiness to setup. If your kernel is recent enough and support perf-events you could try to use perf to accurately know where the CPU is spent. If it does not, you could try oprofile even though that's more complex to setup than perf. When you don't have those tools ready to be used, you can use the poor man profiling tool, i.e. sample that backtrace of slapd in loop using pstack. If you do not have pstack, you can achieve the same in a more heavy weight manner with gdb. If you do not have gdb, you may try ltrace/strace. If you don't have those, you should ask for some linux sysadmin help ;-)
++Cyrille
________________________________ From: openldap-technical-bounces@OpenLDAP.org [mailto:openldap-technical-bounces@OpenLDAP.org] On Behalf Of Jeffrey Crawford Sent: Friday, March 16, 2012 7:28 AM To: OpenLDAP technical list Subject: OpenLDAP high CPU usage when performing mass changes
We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
The Problem is that this system has to be highly available, even for writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
So far the only way I've been able to mitigate the issues is to reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
Thanks in advance
-- I fly because it releases my mind from the tyranny of petty things . . .
- Antoine de Saint-Exupéry
Jeffrey E. Crawford ITS Application Administrator (IDM) 831-459-4365 jeffreyc@ucsc.edumailto:jeffreyc@ucsc.edu
Maucci, Cyrille wrote:
Hi, So in my decreasing order of preference due to decreasing accuracy/easiness to setup. If your kernel is recent enough and support perf-events you could try to use perf to accurately know where the CPU is spent. If it does not, you could try oprofile even though that's more complex to setup than perf.
oprofile is good, and doesn't require the most recent kernels. But when you're seeing close to 100% CPU usage, it doesn't take a fine-grained profiler to see what's happening. A gdb stack trace will usually reveal the culprit immediately. Of course, it's more readable if you're running a non-optimized binary with debug symbols intact.
When you don't have those tools ready to be used, you can use the poor man profiling tool, i.e. sample that backtrace of slapd in loop using pstack. If you do not have pstack, you can achieve the same in a more heavy weight manner with gdb. If you do not have gdb, you may try ltrace/strace.
ltrace/strace are generally useless for debugging slapd issues. - ltrace only traces calls into installed libraries. There are only two classes of CPU-hog bugs encountered with slapd: a) a stupid programmer error in OpenLDAP code which causes a tight infinite loop in OpenLDAP code, and thus never hits any library functions. b) a stupid programmer error in a library which causes a tight infinite loop in the library, and thus will only show up as a single library call. In both cases, a gdb stack trace will be more informative.
(b) is the most common case, and these days it's almost always glibc malloc at fault.
- strace only traces system calls. slapd performs system calls for only a few purposes, almost all of which are to perform I/O. I/O calls will never result in 100% CPU usage.
If you don't have those, you should ask for some linux sysadmin help ;-) ++Cyrille
Jeffrey Crawford wrote:
We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
The Problem is that this system has to be highly available, even for writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.
So far the only way I've been able to mitigate the issues is to reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux platforms.)
Hi Howard,
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
What's the quickest way to validate this on the running-at-99%-slapd, prior to falling back on tcmalloc? Can the proc's smaps reveal this? Like if we're seeing loads many 64MB regions?
Thanks ++Cyrille
-----Original Message----- From: openldap-technical-bounces@OpenLDAP.org [mailto:openldap-technical-bounces@OpenLDAP.org] On Behalf Of Howard Chu Sent: Friday, March 16, 2012 8:32 AM To: Jeffrey Crawford Cc: OpenLDAP technical list Subject: Re: OpenLDAP high CPU usage when performing mass changes
Jeffrey Crawford wrote:
We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
The Problem is that this system has to be highly available, even for writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.
So far the only way I've been able to mitigate the issues is to reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux platforms.)
Maucci, Cyrille wrote:
Hi Howard,
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
What's the quickest way to validate this on the running-at-99%-slapd, prior to falling back on tcmalloc?
Get the gdb stack trace.
Can the proc's smaps reveal this? Like if we're seeing loads many 64MB regions?
Get the gdb stack trace.
Don't guess.
Get the gdb stack trace.
Don't bother with other ineffective diagnostic tools.
Get the gdb stack trace.
Don't google for the symptoms.
Get the gdb stack trace.
Whatever is going on, if the CPU is near 100%, then most likely whatever non-idle thread you see in the stack trace is going to show you the location of the problem.
malloc is normally a fast operation. The chance of you catching slapd inside malloc on any random stack trace is usually near zero, when all's well. If you catch slapd inside glibc malloc during one of these 100% CPU instances, then that's a fair indication. If you resume and then get another trace a few seconds later, and the trace looks the same, then that's pretty conclusive.
Thanks ++Cyrille
-----Original Message----- From: openldap-technical-bounces@OpenLDAP.org [mailto:openldap-technical-bounces@OpenLDAP.org] On Behalf Of Howard Chu Sent: Friday, March 16, 2012 8:32 AM To: Jeffrey Crawford Cc: OpenLDAP technical list Subject: Re: OpenLDAP high CPU usage when performing mass changes
Jeffrey Crawford wrote:
We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
The Problem is that this system has to be highly available, even for writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.
So far the only way I've been able to mitigate the issues is to reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux platforms.)
I'm sure this would have been immensely helpful If I had mentioned that this is running on FreeBSD, just in case that changes answers
On Fri, Mar 16, 2012 at 12:32 AM, Howard Chu hyc@symas.com wrote:
Jeffrey Crawford wrote:
We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them.
When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted.
When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.
At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
Fair enough doesn't look like gperftools are a package in FreeBSD so we'll have to do a manual install but it looks like this would be the easiest and logical first step to just "see if this fixes things"
Just for reference we would want to set LD_PRELOAD=/path/to/libtcmalloc_minimal.so and not try to compile OpenLDAP against it right?
The Problem is that this system has to be highly available, even for
writing and when these updates "shock" the system, the response time goes way down when the process are turning like that. I don't think they are trying to catch up to the data changes because if I let them run a while after the updates are done. (Talking like 1hr) and then restart the instances, they go back to their normal state.
If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.
I can try that but I don't think it is
So far the only way I've been able to mitigate the issues is to
reconfigure our ldap proxy instances to a machine that is having less trouble, restart the instances that are chugging along, then repoint the proxies back to the one just started, and start the others. Not exactly a quick operation.
I've played with cache settings for both OpenLDAP and BDB and have gotten the frequency of this issue reduced but I can't seem to get rid of it completely and it shows up quite often after large data manipulations. I'm at a loss of how to debug since nothing is crashing. Any suggestions on how to find out what's causing this would be very helpful. The logs are not throwing any warnings or posting messages that would seem out of the ordinary and I have played with the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.
I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux platforms.)
Ugh this would get ugly, our institution frowns upon doing "special installs" and the back-mdb isn't part of the vendor (Yes FreeBSD is considered a vendor) install, not saying it can't be done but it feels like a Spanish inquisition when having to make a request like this. Of course none of you care about this ;)
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/**project/http://www.openldap.org/project/
Jeffrey Crawford wrote:
I'm sure this would have been immensely helpful If I had mentioned that this is running on FreeBSD, just in case that changes answers
Indeed. I don't recall, does FreeBSD use glibc as their system C library? If not, then tcmalloc may not be relevant.
On Fri, Mar 16, 2012 at 12:32 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Jeffrey Crawford wrote: We are using openldap 2.4.26 with BDB 4.8 and have replication set up in mirror mode for our main ldap database. There are a couple of other replicas that have a subset of the data that the main cluster has but we are seeing the following behavior on all of them. When performing mass updates via LDAP, lets say on the order of 30,000 entries being added to existing entries. We've noticed that the CPU use of the slapd instances goes through the roof (between 65% and 95% continuously), and seems to stay there until it is restarted. When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process. At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.
Fair enough doesn't look like gperftools are a package in FreeBSD so we'll have to do a manual install but it looks like this would be the easiest and logical first step to just "see if this fixes things"
The *first step* is to get the gdb backtrace and see WTH is going on.
Just for reference we would want to set LD_PRELOAD=/path/to/libtcmalloc_minimal.so and not try to compile OpenLDAP against it right?
If it appears that tcmalloc is called for, then yes, this is the way to go.
openldap-technical@openldap.org