Quanah,
Please see my comments in your previous e-mail.
Thanks,
Rodrigo.
Quanah Gibson-Mount wrote:
> --On Thursday, August 27, 2009 6:39 AM -0700 Rodrigo Costa
> <rlvcosta(a)yahoo.com> wrote:
>
>> Quanah,
>>
>> Please see answer in your previous e-mail below.
>>
>> I'm also sending the information I could collect attached since it is a
>> small file(5KB).
>>
>> The behavior that appears strange and that could indicate a problem is
>> the fact that even when consumer is stopped the provider still doing
>> something for a long time. This doesn't appear to be correct.
>>
>> Other strange behavior is that when system enters in this state one
>> provider CPU stays running around 100% CPU usage. I made a jmeter script
>> to test individual bind/search(no ldapsearch *) and then even with some
>> load(like 200 simultaneous query) I do not see CPU in 100%. Something
>> doesn't appear to be ok since I do not see why CPU should enter in 100%
>> permanently.
>
> I explained to you previously why this would be. Other comments inline.
>
>>> Why are you stopping the provider to do a slapcat?
>> [Rodrigo]Faster dump of data. And in any case if other situation like a
>> problema occurs the secondary system could stay disconnect for other
>> reasons.
> [Rodrigo] I have 2 reasons :
1)Since backup takes sometime and DB has multiple branches for the same
record the only way to have a consistent backup is executing a cold backup;
2)slapcat in a stop slapd could perform faster and also fulfill item 1
above(cold backup)
> Do you have any evidence that an offline slapcat is faster than one
> while slapd is running? I don't understand what you mean in the rest
> of that sentence.
> [Rodrigo] I didn't try with load traffic but it seems reasonable if a
> cold backup is faster and cleaner than a hot backup.
>>>> Even a small number of entrances are different when consumer in
>>>> Provider 2
>>>> connects to Provider 1 then syncrepl enters in the full DB search as
>>>> expected.
>>>
>>>
>>> What is your sessionlog setting on each provider for the syncprov
>>> overlay?
>> [Rodrigo]
>> syncprov-checkpoint 10000 120
>> syncprov-sessionlog 100000
>
> Hm, I would probably checkpoint the cookie a lot more frequently than
> you have it set to. The sessionlog setting seems fine to me.
[Rodrigo] Ok
>
>> Same configuration in both systems.
>>>
>>>> For definition purposes I have some memory limitations where I need to
>>>> limit dncachesize for around 80% of DB entrances.
>>>
>>> We already went through other things you could do to reduce your
>>> memory footprint in other ways. You've completely ignored that
>>> advice. As long as your dncachesize is in this state, I don't expect
>>> things to behave normally.
>> [Rodrigo]I implemented what was possible. The end is this cache config
>> possible by the memory constraints :
>> # Cache values
>> # cachesize 10000
>> cachesize 20000
>> dncachesize 3000000
>> # dncachesize 400000
>> # idlcachesize 10000
>> idlcachesize 30000
>> # cachefree 10
>> cachefree 100
>
> You don't say anything in here about your DB_CONFIG settings, which is
> where you could stand to gain the most amount of memory back. I do
> see you're definitely running a very restricted
> cachesize/idlcachesize. ;)
> [Rodrigo]DB_CONFIG is using only 100MB of memory and DB_LOG_AUTOREMOVE.
>
>
>>> What value did you set for "cachefree"?
>> [Rodrigo] cachefree 100
> [Rodrigo] I made the change proposed and tested. The behavior was
> really better since after dncachesize was filled the issue did not
> repeated as before.
BUT it just took more time until the behavior repeats. After some more
time then just after dncachesize reaches around 3Mi the behavior
returned. What happens is :
1-> Provider 1 CPU start to consume around 100%;
2-> Consumer 2 CPU goes to 0% consumption(before it was around 10% when
replication in place);
3-> Replication never ends(I cannot see in the Provider 2 data) and even
I stop Consumer 2(or slapd) the CPU in Provider 1 remains days in 100%.
Looks like code enter in a dead loop which I could not identify the
condition or the requirement to avoid it. I generated some GDB traces
and as soon as possible(there is space) I will put in the ftp.
>
> This value is likely substantially way too low for your system
> configuration. This is how many entries get freed from any of the
> caches. With your dncachesize being 3,000,000, removing 100 entries
> from it will do hardly anything, and may be part of the issue. If it
> wasn't for the major imbalance between your entry, idl, and
> dncachesizes, I would suggest a fairly high value like 100,000. But
> given your entry cache is 20,000, you'll probably have to limit the
> cachefree to 5000-10000. But it is going to need to be higher than 100.
>
> --Quanah
>
> --
>
> Quanah Gibson-Mount
> Principal Software Engineer
> Zimbra, Inc
> --------------------
> Zimbra :: the leader in open source messaging and collaboration
>
>