Re: (ITS#6275) syncrepl taking long(not sync) when consumer not connect for a moment - openldap-bugs

27 Aug 2009


      Quanah,
Please see my comments in your previous e-mail.
Thanks,
Rodrigo.
Quanah Gibson-Mount wrote:
...
--On Thursday, August 27, 2009 6:39 AM -0700 Rodrigo Costa 
rlvcosta@yahoo.com wrote:
...
Quanah,
Please see answer in your previous e-mail below.
I'm also sending the information I could collect attached since it is a
small file(5KB).
The behavior that appears strange and that could indicate a problem is
the fact that even when consumer is stopped the provider still doing
something for a long time. This doesn't appear to be correct.
Other strange behavior is that when system enters in this state one
provider CPU stays running around 100% CPU usage. I made a jmeter script
to test individual bind/search(no ldapsearch *) and then even with some
load(like 200 simultaneous query) I do not see CPU in 100%. Something
doesn't appear to be ok since I do not see why CPU should enter in 100%
permanently.
I explained to you previously why this would be.  Other comments inline.
...
...
Why are you stopping the provider to do a slapcat?
[Rodrigo]Faster dump of data. And in any case if other situation like a
problema occurs the secondary system could stay disconnect for other
reasons.
[Rodrigo] I have 2 reasons :
1)Since backup takes sometime and DB has multiple branches for the same 
record the only way to have a consistent backup is executing a cold backup;
2)slapcat in a stop slapd could perform faster and also fulfill item 1 
above(cold backup)
...
Do you have any evidence that an offline slapcat is faster than one 
while slapd is running?  I don't understand what you mean in the rest 
of that sentence.
[Rodrigo] I didn't try with load traffic but it seems reasonable if a 
cold backup is faster and cleaner than a hot backup.
...
...
...
Even a small number of entrances are different when consumer in
Provider 2
connects to Provider 1 then syncrepl enters in the full DB search as
expected.
What is your sessionlog setting on each provider for the syncprov
overlay?
[Rodrigo]
syncprov-checkpoint 10000 120
syncprov-sessionlog 100000
Hm, I would probably checkpoint the cookie a lot more frequently than 
you have it set to.  The sessionlog setting seems fine to me.
[Rodrigo] Ok
...
...
Same configuration in both systems.
...
...
For definition purposes I have some memory limitations where I need to
limit dncachesize for around 80% of DB entrances.
We already went through other things you could do to reduce your
memory footprint in other ways.  You've completely ignored that
advice.  As long as your dncachesize is in this state, I don't expect
things to behave normally.
[Rodrigo]I implemented what was possible. The end is this cache config
possible by the memory constraints :
# Cache values
# cachesize       10000
cachesize       20000
dncachesize     3000000
# dncachesize    400000
# idlcachesize    10000
idlcachesize    30000
# cachefree       10
cachefree       100
You don't say anything in here about your DB_CONFIG settings, which is 
where you could stand to gain the most amount of memory back.  I do 
see you're definitely running a very restricted 
cachesize/idlcachesize. ;)
[Rodrigo]DB_CONFIG is using only 100MB of memory and DB_LOG_AUTOREMOVE.
...
...
What value did you set for "cachefree"?
[Rodrigo] cachefree       100
[Rodrigo] I made the change proposed and tested. The behavior was 
really better since after dncachesize was filled the issue did not 
repeated as before.
BUT it just took more time until the behavior repeats. After some more 
time then just after dncachesize reaches around 3Mi the behavior 
returned. What happens is :
1-> Provider 1 CPU start to consume around 100%;
2-> Consumer 2 CPU goes to 0% consumption(before it was around 10% when 
replication in place);
3-> Replication never ends(I cannot see in the Provider 2 data) and even 
I stop Consumer 2(or slapd) the CPU in Provider 1 remains days in 100%.
Looks like code enter in a dead loop which I could not identify the 
condition or the requirement to avoid it. I generated some GDB traces 
and as soon as possible(there is space) I will put in the ftp.
...
This value is likely substantially way too low for your system 
configuration.  This is how many entries get freed from any of the 
caches. With your dncachesize being 3,000,000, removing 100 entries 
from it will do hardly anything, and may be part of the issue.  If it 
wasn't for the major imbalance between your entry, idl, and 
dncachesizes, I would suggest a fairly high value like 100,000.  But 
given your entry cache is 20,000, you'll probably have to limit the 
cachefree to 5000-10000.  But it is going to need to be higher than 100.
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc

Zimbra ::  the leader in open source messaging and collaboration