brandon.hume@dal.ca wrote:
Full_Name: Brandon Hume Version: 2.4.31 OS: RHEL EL6.1, kernel 2.6.32-131.12.1.el6.x86_64 URL: http://den.bofh.ca/~hume/ol-2.4.31_memleak.tar.gz Submission from: (NULL) (2001:410:a010:2:223:aeff:fe74:400e)
OpenLDAP 2.4.31 compiled in 64-bit with BerkeleyDB 5.3.15 appears to exhibit a memory leak while replicating the full database from another node in MMR.
A two-node MMR configuration has been set up. Node 1 is full populated with data, approximately 338k DNs, which occupies around 1G on-disk (including bdb __db.* and log.* files). Node 1 is brought up and on a 64-bit system occupies around 5.5G VM and 4.7G RSS.
Node 2 is initialized with a copy of cn=config (slapcat/slapadd method) and brought up with an empty database to begin replication. Over the course of the replication, node 2's slapd will grow continuously. On the one occasion it managed to "finish" the replication (with the test database), node 2's slapd occupied 14G VM and approximately 6G RSS.
I've included a link to the test kit I put together. This includes a fairly large, anonimized database, as well as a simplified copy of the configuration. I've left in the sendmail and misc schemas but removed irrelevant local schemas. Also included are the DB_CONFIGs used for the main database and accesslog, and the configuration scripts used for compiling both bdb and OpenLDAP.
Steps to reproduce: - Compile and install bdb and OpenLDAP with options the same as in the config-db.sh and config-ldap.sh scripts. - Initialize configuration on node 1 and 2 using "slapadd -F etc/slapd.d -b cn=config -l slapd-conf.ldif". - Initialize main DB on node 1 using "slapadd -l test_dit.ldif" - Start node 1. The slapd process should stabilize at around 5G VM in use. - Start node 2 and allow it to begin replication.
I've tested with node 2 on both RHEL6 and on Solaris 10. In both cases, node 2's slapd became extremely bloated over the course of several hours. Only the Solaris SPARC box was able to complete the replication, stabilizing at 14G VM used. The Redhat x86 box continued to grow far beyond the 16G swap limit and was killed by the OS.
I've attempted to use the Solaris libumem tools to trace the memory leak, using gcore on the running process and "::findleaks -dv" within mdb running on the core. I've included the report generated in case it provides any useful information as "mdb_findleaks_analysis.txt". Disregard if you wish.
(I apologize for the large test LDIF. I wanted something to definitively show the problem so didn't want to trim it too much...)
Thanks for the detailed report, your test revealed several bugs. The leaks are now fixed in git master.
There's still another issue where node 2 starts sending the received changes back to node 1, even though they came from node 1 originally. This is triggered because most of your entries were created with sid=0, and syncprov doesn't know that they actually originated from node 1 (sid=1). That wastes a lot of CPU/network while it sends over a bunch of data that isn't needed, but that's all a separate issue from the memory leaks.