On Wed, Nov 11, 2009 at 3:01 PM, Howard Chu hyc@symas.com wrote:
Edward Capriolo wrote:
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG) tschlabach@tascel.net wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Testing reveals the presence of bugs, not the absence :) So no one can every say version x.y.z is certified bug free.
However, I do tend to agree, in that my MM just flaked out, and there is not much load/write/update going on so I am a bit worried.
I am not trying to put down OpenLDAP but iplanet/fedora directory server/389 support up to a 4 way MM implementation and I have found the replication rock solid even under high load. So if MM is your requirement that may be a more valid option.
The historical evidence disagrees with your assertion. Even at this late date, FDS MMR still breaks irrecoverably.
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
How many years have they been flogging this feature? They still haven't got it right. They can't.
MMR is inherently flawed, as we have been saying for years.
http://www.watersprings.org/pub/id/draft-zeilenga-ldup-harmful-02.txt
We have implemented it in OpenLDAP mainly for political reasons, not because we changed our minds and now believe it to be technically sound. It is not. We developed and recommend MirrorMode because the only safe way to do replication is by preserving single-master consistency.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a feature. It is not marked as experimental anymore so people are going to use it. Once it fails you can't call them a "Poor Directory Designer" for using it.
If they have implemented MMR without reading all of the warnings, they are certainly poor designers for not becoming fully informed of the topic before deploying it. If they have implemented MMR after reading all of the warnings, they made a conscious choice.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
I understand that open LDAP does not do distributed locking, as a result I do not expect it to have ACID compliance.
Fedora Directory Server/389 has a "last update wins policy" so this is a much more optimistic strategy, but it works (for what I was doing)
Since I have joined this mailing list after my problems started, about a month ago, I have seem at least 4 other threads with similar issues.
http://www.openldap.org/lists/openldap-software/200911/msg00015.html http://www.openldap.org/lists/openldap-software/200911/msg00021.html ...
Upgrade to 2.4.19 is suggested as a resolution, and I found another thread with a bigger problem in that version.
As to the link you have posted: https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
It is very easy to quickly search a mailing list and find some people having problems software. That does not prove FDS has many MM problems. I personally ran two node FDS instance with very active WRITE/UPDATE for two years and had only a few isolated problems.
If they have implemented MMR without reading all of the warnings, they are certainly poor designers for not becoming fully informed of the topic before deploying it.
From my prospective, I find the reliability of M-M openldap on 2.4.16
brittle. I am not the only one having problems. Your comment seems to suggest I did not read enough. I would upgrade to 2.4.19 but someone else on this list is having problems with that so that does not seem like a safe option.
Since I have installed openldap on two lightly traffic nodes: 1) One node locked up 2) After lockup/restart the nodes did not re-establish two way replication connection 3) I have out of sync data (which I do not believe was added during the downtime caused by 1)
Linking to an RFC and implying that I "Don't read enough" is wrong. If my light usage is bringing to light obvious bugs and I am not the only one having these issues, not enough testing on the software development side is being done.
As an administrator I ran 'make test' and watched test050-syncrepl-multimaster complete. That coupled with the fact that multi-master is no longer being labeled as an "experimental" feature led me believe it worked reasonably well.
The RFC makes no mention of my #2 problem 'After lockup/restart the nodes did not re-establish two way replication connection'. Is that supposed to be the fault of the user? This is obviously a bug or an edge case. This is not the fault of a user not reading enough. Which is where the frustration is I think. People are willing to accept the failure cases covered in the RFC, but the RFC is not a blanked statement "WE told you not to run this" for every bug that appears.