Re: (ITS#7292) Memory leak with MMR + delta syncrepl - openldap-bugs

8 Jun 2012


      brandon.hume@dal.ca wrote:
...
Full_Name: Brandon Hume
Version: 2.4.31
OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64
URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz
Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)
OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a
memory leak while replicating the full database from another node in MMR.
A two-node MMR configuration has been set up.  Node 1 is full populated with
data, approximately 338k DNs, which occupies around 1G on-disk (including bdb
__db.* and log.* files).  Node 1 is brought up and on a 64-bit system occupies
around 5.5G VM and 4.7G RSS.
Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and
brought up with an empty database to begin replication.  Over the course of the
replication, node 2's slapd will grow continuously.  On the one occasion it
managed to "finish" the replication (with the test database), node 2's slapd
occupied 14G VM and approximately 6G RSS.
I've included a link to the test kit I put together.  This includes a fairly
large, anonimized database, as well as a simplified copy of the configuration.
I've left in the sendmail and misc schemas but removed irrelevant local schemas.
  Also included are the DB_CONFIGs used for the main database and accesslog, and
the configuration scripts used for compiling both bdb and OpenLDAP.
Steps to reproduce:
     - Compile and install bdb and OpenLDAP with options the same as in the
config-db.sh and config-ldap.sh scripts.
     - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b
cn=config -l slapd-conf.ldif".
     - Initialize main DB on node 1 using "slapadd -l test_dit.ldif"
     - Start node 1.  The slapd process should stabilize at around 5G VM in use.
     - Start node 2 and allow it to begin replication.
I've tested with node 2 on both RHEL6 and on Solaris 10.  In both cases, node
2's slapd became extremely bloated over the course of several hours.  Only the
Solaris SPARC box was able to complete the replication, stabilizing at 14G VM
used.  The Redhat x86 box continued to grow far beyond the 16G swap limit and
was killed by the OS.
I've attempted to use the Solaris libumem tools to trace the memory leak, using
gcore on the running process and "::findleaks -dv" within mdb running on the
core.  I've included the report generated in case it provides any useful
information as "mdb_findleaks_analysis.txt".  Disregard if you wish.
(I apologize for the large test LDIF.  I wanted something to definitively show
the problem so didn't want to trim it too much...)
Thanks for the detailed report, your test revealed several bugs. The leaks are 
now fixed in git master.
There's still another issue where node 2 starts sending the received changes 
back to node 1, even though they came from node 1 originally. This is 
triggered because most of your entries were created with sid=0, and syncprov 
doesn't know that they actually originated from node 1 (sid=1). That wastes a 
lot of CPU/network while it sends over a bunch of data that isn't needed, but 
that's all a separate issue from the memory leaks.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/