syncrepl replication taking too long(not sync) - openldap-software

18 Aug 2009


      openldap software community,
I'm facing some difficulties to have database synchronized with 
syncrepl. I'm running the latest openldap 2.4.17 version which after 
these issues I compiled with gdb.
I have a DB(divided really in 2 DBs) where each one has around 4 million 
entrances. Based in memory limitations I have a dncachesize configured 
with around 3000000, or smaller than the maximum number of entrances in DBs.
I loaded both server with all indexes and the same data. Starting both 
there isn't any need for syncrepl(thread from slapd) to make any search 
and then both mirrors are in sync and consuming each other. If a new 
entrance is create the other consumes since both are listening right on 
when it happens.
If I stop one mirror and create even small number of entrances in the 
other, like 10, when I try to start the other provider the syncrepl 
enters in conventional syncrepl replication which search the DB for 
synchronization.
This never ends causing mirrors not in synchronization. What I can see is :
1) Stop the Second mirror, like for slapcat(calling second and first as 
reference);
2) Add a few entrances in First mirror(kept on-line);
3) Second mirror start again after First mirror had some new entrances 
added by normal operation;
4) Syncrepl in second mirror enters in the conventional syncrepl 
replication since it detects that something is different between mirrors;
5) Until dncache is not filled the First mirror slapd cpu consumption is 
below 100%(around 50%) and search happens in a good manner since monitor 
shows it;
6) After dncache is filled(oscillates above 3mi) the First mirror cpu 
consumption enter in 100% consumption, oscillating between 98% to 102%;
7) The search never ends and then systems are never in sync. Cpu is 
permanently in high consumption, almost always in 100%.
I let days this process running and I could see only a one or two 
entrances in sync. By the CPU looks like something is hanging the search 
where some loop is keeping the thread consuming one full cpu processing.
I could collect some GDB information which I'm sending attached. Not 
sure how to interpret this overlay_walk.
The idea is to stop one mirror for backup releasing this task from the 
primary server. For this replication would need to happen.
Your comments are very welcome.
Regards,
Rodrigo.