Re: Troubleshooting synchronization

11 Nov 2009


      On Wed, Nov 11, 2009 at 3:01 PM, Howard Chu hyc@symas.com wrote:
...
Edward Capriolo wrote:
...
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG)
tschlabach@tascel.net wrote:
...
Hi Quanah!
...
I suggest you go read the CHANGES log for what has been fixed between
2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature
which has been around for quite some time, the answer is more often that
I like to hear: You need to use the latest version released last week /
month or so.
I have indeed read the CHANGES and seen that some issues have been
fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and
the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we
understand that we should not blame people but try and help if we find a
a problem.
For that reason I have asked in my email for help on *understanding* and
*diagnosing* problems to have a chance to contribute in case we will
find any new issues.
Also our customers may not like it if in case of a problem we tell them:
Let's wait if in some weeks a new release will come which will fix it or
not. So I'd rather be in a position to get my hands dirty myself in case
of problems.
Regards,
Torsten
Quanah Gibson-Mount schrieb:
...
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach
(Tascel eG)" tschlabach@tascel.net wrote:
...
Hi all!
I am currently trying to chase some problems in an n-way multi-master
setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between
2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc

Zimbra ::  the leader in open source messaging and collaboration
...
...
Also how would I know that *now* in 2.4.19 all problems are fixed and
the answer next week won't be: You need to use 2.4.20.
Testing reveals the presence of bugs, not the absence :)  So no one
can every say version x.y.z is certified bug free.
However, I do tend to agree, in that my MM just flaked out, and there
is not much load/write/update going on so I am a bit worried.
I am not trying to put down OpenLDAP but iplanet/fedora directory
server/389 support up to a 4 way MM implementation and I have found
the replication rock solid even under high load. So if MM is your
requirement that may be a more valid option.
The historical evidence disagrees with your assertion. Even at this late date,
FDS MMR still breaks irrecoverably.
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
How many years have they been flogging this feature? They still haven't got it
right. They can't.
MMR is inherently flawed, as we have been saying for years.
http://www.watersprings.org/pub/id/draft-zeilenga-ldup-harmful-02.txt
We have implemented it in OpenLDAP mainly for political reasons, not because
we changed our minds and now believe it to be technically sound. It is not. We
developed and recommend MirrorMode because the only safe way to do replication
is by preserving single-master consistency.
...
...
...
The answer is quite simple: do not use multimaster replication in a
production environment. In most cases the requirement for multimaster
replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a
feature. It is not marked as experimental anymore so people are going
to use it. Once it fails you can't call them a "Poor Directory
Designer" for using it.
http://www.openldap.org/faq/data/cache/1240.html
If they have implemented MMR without reading all of the warnings, they are
certainly poor designers for not becoming fully informed of the topic before
deploying it. If they have implemented MMR after reading all of the warnings,
they made a conscious choice.
--
 -- Howard Chu
 CTO, Symas Corp.           http://www.symas.com
 Director, Highland Sun     http://highlandsun.com/hyc/
 Chief Architect, OpenLDAP  http://www.openldap.org/project/
I understand that open LDAP does not do distributed locking, as a
result I do not expect it to have ACID compliance.
Fedora Directory Server/389 has a "last update wins policy"  so this
is a much more optimistic strategy, but it works (for what I was
doing)
Since I have joined this mailing list after my problems started, about
a month ago, I have seem at least 4 other threads with similar issues.
http://www.openldap.org/lists/openldap-software/200911/msg00015.html
http://www.openldap.org/lists/openldap-software/200911/msg00021.html
...
Upgrade to  2.4.19 is suggested as a resolution, and I found another
thread with a bigger problem in that version.
As to the link you have posted:
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
It is very easy to quickly search a mailing list and find some people
having problems software. That does not prove FDS has many MM
problems. I personally ran two node FDS instance with very active
WRITE/UPDATE for two years and had only a few isolated problems.
...
...
If they have implemented MMR without reading all of the warnings,
they are certainly poor designers for not becoming fully informed of the topic before deploying it.
...
From my prospective, I find the reliability of M-M openldap on 2.4.16
brittle. I am not the only one having problems. Your comment seems to
suggest I did not read enough. I would upgrade to 2.4.19 but someone
else on this list is having problems with that so that does not seem
like a safe option.
Since I have installed openldap on two lightly traffic nodes:
1) One node locked up
2) After lockup/restart the nodes did not re-establish two way
replication connection
3) I have out of sync data (which I do not believe was added during
the downtime caused by 1)
Linking to an RFC and implying that I "Don't read enough" is wrong. If
my light usage is bringing to light obvious bugs and I am not the only
one having these issues, not enough testing on the software
development side is being done.
As an administrator I ran 'make test' and watched
test050-syncrepl-multimaster complete. That coupled with the fact that
multi-master is no longer being labeled as an "experimental" feature
led me believe it worked reasonably well.
The RFC makes no mention of my #2 problem 'After lockup/restart the
nodes did not re-establish two way replication connection'. Is that
supposed to be the fault of the user? This is obviously a bug or an
edge case. This is not the fault of a user not reading enough. Which
is where the frustration is I think. People are willing to accept the
failure cases covered in the RFC, but the RFC is not a blanked
statement "WE told you not to run this" for every bug that appears.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: Troubleshooting synchronization