I've got my two-node setup up and running, MMR w/ delta-syncrepl. Node 1 is up and running with ~340k entries in the main DIT. It consumes about 4G of VM on a Redhat AS6 box, and the data dir is around 2.3G on-disk (including __db.* and the one log.* file...), and another 100M in cn=accesslog.
I'm bringing up node 2 after nuking the data dir, and letting it syncrepl from nothing. I was tracking a SIGBUS error, but that turned out to be olcLogLevel=Stats+Sync generating a huge amount of logging and filling up the disk. Fixed that, moved on.
Now, it's still crashing... but that's because slapd is bloating up hugely and causing the machine to run out of VM and kill the process. I've added about 8G of temporary swap, and it's still going. slapd on node2 is 5.5G resident, and nearly 15G in size total now. On-disk the hdb is only around 1.6G, so it still has a long ways to go.
Can I assume this is not the way it should be?
vmstat shows a lot of swap-out but almost no swap-in, so it's not thrashing. pmap shows a large number of 64M "anon" segments. 140 of those at one point, and 157 when I checked just now.
It looks a lot like a memory leak, though I can't tell offhand whether the problem is in OpenLDAP (2.4.31) or in BerkeleyDB (5.3.15). When it finishes, I'm planning on turning off delta-syncrepl and pave/rebuild again, and see if it behaves the same. I could also give mdb a shot, but since this is a mirror I'd have to rebuild both sides.
Any suggestions as to where I could start looking for the source of the problem? Obviously I'm not planning on rebuilding this node on a regular basis (and certainly not all via syncrepl) but I'm concerned that over an extended period of time it'll leak memory even during normal use.
I'm including my DB_CONFIG, just in case. I can supply more of the config as necessary.
DB_CONFIG:
set_cachesize 0 536870912 0 set_lg_regionmax 10485760 set_lg_max 104857600 set_lg_bsize 26214400 set_lk_max_locks 4096 set_lk_max_objects 4096 set_flags DB_LOG_AUTOREMOVE
(If anything looks particularly stupid in here, even unrelated to the leak, I'd love the advice...)
--On Friday, May 18, 2012 3:15 PM -0300 Brandon Hume hume-ol@bofh.ca wrote:
hdb is only around 1.6G, so it still has a long ways to go.
I'm including my DB_CONFIG, just in case. I can supply more of the config as necessary.
DB_CONFIG:
set_cachesize 0 536870912 0
Are you using shm keys with BDB?
I don't know anyone who is using BDB 5.3 at this point in time. I would back down to something less recent and see if you see the same issue. I've used BDB 4.7 for quite some time w/o issue.
Also, your BDB cachesize seems undersized, given the size of your DB. It should be no less than dn2id.bdb + id2entry.bdb + 10% for growth. Personally, I prefer to keep my BDB cachesize at <size of DB>, or in your case, I'd set it to 2 GB, i.e.:
set_cachesize 2 0 0
You certainly could try switching just this one side to mdb, and seeing how it does, and then switching the other side over if you like how it is performing compared to BDB. Both nodes are not required to run the same DB backend.
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
On 05/18/12 03:36 PM, Quanah Gibson-Mount wrote:
Are you using shm keys with BDB?
Not at this time.
I don't know anyone who is using BDB 5.3 at this point in time. I would back down to something less recent and see if you see the same issue. I've used BDB 4.7 for quite some time w/o issue.
Well, with 4.8.30, the problem is still evident. Overall, using delta-syncrepl seems to make a large difference. With my previous test (still with bdb 5.3) and delta-syncrepl, the slapd process grew to 16G before dying. With delta off (removing the 'syncdata=accesslog logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"' portion of the olcSyncrepl line) the node started and finished loading the entire DB and grew to about 5.7G in total.
I don't have the bdb 4.7 tarball handy... but I'll being giving mdb a shot next and see if it behaves the same.
Also, your BDB cachesize seems undersized, given the size of your DB. It should be no less than dn2id.bdb + id2entry.bdb + 10% for growth. Personally, I prefer to keep my BDB cachesize at <size of DB>, or in your case, I'd set it to 2 GB, i.e.:
Thanks, if I stick with hdb I'll make that tweak. Those values are pretty old, I've been shy about changing too much while changing everything else. :)
Brandon Hume wrote:
On 05/18/12 03:36 PM, Quanah Gibson-Mount wrote:
Are you using shm keys with BDB?
Not at this time.
I don't know anyone who is using BDB 5.3 at this point in time. I would back down to something less recent and see if you see the same issue. I've used BDB 4.7 for quite some time w/o issue.
Well, with 4.8.30, the problem is still evident. Overall, using delta-syncrepl seems to make a large difference. With my previous test (still with bdb 5.3) and delta-syncrepl, the slapd process grew to 16G before dying. With delta off (removing the 'syncdata=accesslog logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"' portion of the olcSyncrepl line) the node started and finished loading the entire DB and grew to about 5.7G in total.
The fact that switching off delta changes the behavior would point pretty squarely at a bug in the delta-sync consumer code, not anything in the underlying backend. If you can produce a small test case to demonstrate the problem, it would be a good idea to post this to the ITS.
----- Howard Chu hyc@symas.com wrote:
Brandon Hume wrote:
On 05/18/12 03:36 PM, Quanah Gibson-Mount wrote:
Are you using shm keys with BDB?
Not at this time.
I don't know anyone who is using BDB 5.3 at this point in time. I would back down to something less recent and see if you see the same issue. I've used BDB 4.7 for quite some time w/o issue.
Well, with 4.8.30, the problem is still evident. Overall, using delta-syncrepl seems to make a large difference. With my previous test (still with bdb 5.3) and delta-syncrepl, the slapd process grew to 16G before dying. With delta off (removing the 'syncdata=accesslog logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"' portion of the olcSyncrepl line) the node started and finished loading the entire DB and grew to about 5.7G in total.
The fact that switching off delta changes the behavior would point pretty squarely at a bug in the delta-sync consumer code, not anything in the underlying backend. If you can produce a small test case to demonstrate the problem, it would be a good idea to post this to the ITS.
Or it just means that the accesslog is receiving a significant number of changes and that the time configured for how often to delete the accesslog DB is too small. It will certainly grow without bound up to the time deletion kicks in. For some of our larger customers, I have delete run on all data older than 8 hours, every 2 hours, to keep the accesslog DB size down. I.e., there is still not necessarily evidence of a memory leak.
--Quanah
On 05/23/12 06:42 PM, Quanah Gibson-Mount wrote:
Or it just means that the accesslog is receiving a significant number of changes and that the time configured for how often to delete the accesslog DB is too small. It will certainly grow
To the tune of 16+G of VM (only 4.7G resident) on a 512G accesslog DB and 2G main DB, though? I've tried setting the logpurge to kill > 15min entries every 30min, and it doesn't seem to be making a difference.
I'll try to see if I can generate a pile of entries not using our local schemas and reproduce the problem.
I suppose I can also try switching accesslog over to mdb and see if it still occurs.
--On May 24, 2012 2:45:25 PM -0300 Brandon Hume hume-ol@bofh.ca wrote:
On 05/23/12 06:42 PM, Quanah Gibson-Mount wrote:
Or it just means that the accesslog is receiving a significant number of changes and that the time configured for how often to delete the accesslog DB is too small. It will certainly grow
To the tune of 16+G of VM (only 4.7G resident) on a 512G accesslog DB and 2G main DB, though? I've tried setting the logpurge to kill > 15min entries every 30min, and it doesn't seem to be making a difference.
Yeah, that sounds like a different issue then. ;)
--Quanah
----- Howard Chu hyc@symas.com wrote:
Brandon Hume wrote:
On 05/18/12 03:36 PM, Quanah Gibson-Mount wrote:
Are you using shm keys with BDB?
Not at this time.
I don't know anyone who is using BDB 5.3 at this point in time. I would back down to something less recent and see if you see the same issue. I've used BDB 4.7 for quite some time w/o issue.
Well, with 4.8.30, the problem is still evident. Overall, using delta-syncrepl seems to make a large difference. With my previous test (still with bdb 5.3) and delta-syncrepl, the slapd process grew to 16G before dying. With delta off (removing the 'syncdata=accesslog logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"' portion of the olcSyncrepl line) the node started and finished loading the entire DB and grew to about 5.7G in total.
The fact that switching off delta changes the behavior would point pretty squarely at a bug in the delta-sync consumer code, not anything in the underlying backend. If you can produce a small test case to demonstrate the problem, it would be a good idea to post this to the ITS.
Or it just means that the accesslog is receiving a significant number of changes and that the time configured for how often to delete the accesslog DB is too small. It will certainly grow without bound up to the time deletion kicks in. For some of our larger customers, I have delete run on all data older than 8 hours, every 2 hours, to keep the accesslog DB size down. I.e., there is still not necessarily evidence of a memory leak.
--Quanah
openldap-technical@openldap.org