Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
The result we see currently is that replication works only partially, with some strange errors here and there.
As I believe it will be pointless to post all our cn=config LDIF here and explain scenarios which work and those which don't, I thought it would be more productive to double-check that I have correctly understand what I *should* be seeing happen on my systems and how I can properly monitor this. My problem may be that I still need to learn how to properly monitor my slapd.
To begin with, I would just ask for confirmation of my proper understanding of the documentation:
1. A master server is a server which is using the syncprov overlay (servers/slapd/overlays/syncprov.c). This overlay will do little more than just provide a synchronization cookie (CSN) which consumers may ask for to find out what needs to get replicated and what not.
2. A consumer server is a server in which an additional thread is running which will query the master(s) in a given interval to ask for updated and if any, get them over the wire and into the local copy of the database. This synchronization thread is servers/slapd/syncrepl.c I guess?
3. An N-Way Multi-Master setup is a setup in which N servers are each a master and any of the others is a consumer of all other masters?
I am I right up to here?
So what I fail to understand is:
1. What is the difference between Mirror Mode and N-Way Multi-Master? Especially given that in N-Way Multi-Master, have to set olcMirrorMode to TRUE.
2. Given that I have added a 'Sync' value to the olcLogLevel attribute, what would be the "health check" information I should be watching in the log for to see that replication is attempted as expected.
3. What problems should I be watching for in the logs?
4. Could I for example manually ask a master (using some ldapsearch statement, pretending I was the consumer) what the master thinks which entries I would have to update?
Regards, Torsten
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
"Torsten Schlabach (Tascel eG)" tschlabach@tascel.net writes:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design. Slapd as a stand alone directory is rock stable and outperforms all other products I know of. Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
-Dieter
Hi Dieter!
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
If this is a "do not use feature", for what reason has it been included in the software, in the first place.
Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
When you say "synchronized", do you mean one master and n slaves?
When you say, the requirement for N-way multi-master is usually poor directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
What we want to achieve is a HA solution where *all* directory data is stored on more than one physical machine. I know I can do that by having a master and a slave. But then I would need to have a mechanism entirely external to slapd that if the master fails I turn the slave into a master and vice versa. (However this could be reliably achieved.)
So the idea for N-way multi-master was just: I can point the DNS entry to whatever server in my cluster (possibly there may be more than two) and it will be a writeable directory and I won't ever loose any information I write into that LDAP cloud.
Regards, Torsten
Dieter Kluenter schrieb:
"Torsten Schlabach (Tascel eG)" tschlabach@tascel.net writes:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design. Slapd as a stand alone directory is rock stable and outperforms all other products I know of. Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
-Dieter
Hi Torsten!
"Torsten Schlabach (Tascel eG)" tschlabach@tascel.net writes:
Hi Dieter!
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
If this is a "do not use feature", for what reason has it been included in the software, in the first place.
Well, there is the protocol RFC 4510 and the OpenLDAP Project is aiming to be the reference implementation of this protocol, on the other hand is the OpenLDAP Project a community driven project: http://www.openldap.org/project that is, features not being part of the protocol but may be of interest to the community, can be included. With regard to multimaster replication, this feature has only been included since 2.3 (if I remember correctly) and has undergone heavy recoding ever since. I personally consider multimaster replication still as beta and not stable for production use.
Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
When you say "synchronized", do you mean one master and n slaves?
Yes
When you say, the requirement for N-way multi-master is usually poor directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
probably
What we want to achieve is a HA solution where *all* directory data is stored on more than one physical machine. I know I can do that by having a master and a slave. But then I would need to have a mechanism entirely external to slapd that if the master fails I turn the slave into a master and vice versa. (However this could be reliably achieved.)
What you describe is Mirror Mode.
So the idea for N-way multi-master was just: I can point the DNS entry to whatever server in my cluster (possibly there may be more than two) and it will be a writeable directory and I won't ever loose any information I write into that LDAP cloud.
OK, this requirement does not include multimaster replication, but only Mirror Mode of a HA cluster of providers and chaining write operations of consumers to the active provider.
-Dieter
Hi Dieter!
Your answers are very helpful, thanks for that!
directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
probably
So looking at
http://www.openldap.org/doc/admin24/replication.html#Configuring%20the%20dif...
What I understand is:
18.3.1. Syncrepl
This is about the syncrepl engine as such. So this is applicable to all types of replication and it's the technical basics, right?
18.3.2. Delta-syncrepl
I guess this is simple master-slave, isn't it? Though I fail to understand why this is about deltas while obviously the other mechanisms aren't, are they?
18.3.3. N-Way Multi-Master
This was the first section which explained something which sounded like what I am looking for. So I went for it.
18.3.4. MirrorMode
Looking at the config example, I just can't tell the difference to 18.3.3. N-Way Multi-Master except:
- samples in this section are not cn=config based, but I guess that shouldn't matter but it's just a question of which mechanism I like to use, isn't it? - in both N-Way Multi-Master and in Mirror Mode I have serverID, mirrormode on and syncrepl statements.
So what actually is the difference between Mirror Mode and N-Way Multi-Master except having 2 servers or three servers?
Regards, Torsten
Dieter Kluenter schrieb:
Hi Torsten!
"Torsten Schlabach (Tascel eG)" tschlabach@tascel.net writes:
Hi Dieter!
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
If this is a "do not use feature", for what reason has it been included in the software, in the first place.
Well, there is the protocol RFC 4510 and the OpenLDAP Project is aiming to be the reference implementation of this protocol, on the other hand is the OpenLDAP Project a community driven project: http://www.openldap.org/project that is, features not being part of the protocol but may be of interest to the community, can be included. With regard to multimaster replication, this feature has only been included since 2.3 (if I remember correctly) and has undergone heavy recoding ever since. I personally consider multimaster replication still as beta and not stable for production use.
Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
When you say "synchronized", do you mean one master and n slaves?
Yes
When you say, the requirement for N-way multi-master is usually poor directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
probably
What we want to achieve is a HA solution where *all* directory data is stored on more than one physical machine. I know I can do that by having a master and a slave. But then I would need to have a mechanism entirely external to slapd that if the master fails I turn the slave into a master and vice versa. (However this could be reliably achieved.)
What you describe is Mirror Mode.
So the idea for N-way multi-master was just: I can point the DNS entry to whatever server in my cluster (possibly there may be more than two) and it will be a writeable directory and I won't ever loose any information I write into that LDAP cloud.
OK, this requirement does not include multimaster replication, but only Mirror Mode of a HA cluster of providers and chaining write operations of consumers to the active provider.
-Dieter
Hi Torsten,
"Torsten Schlabach (Tascel eG)" tschlabach@tascel.net writes:
Hi Dieter!
Your answers are very helpful, thanks for that!
directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
probably
So looking at
http://www.openldap.org/doc/admin24/replication.html#Configuring%20the%20dif...
What I understand is:
18.3.1. Syncrepl
This is about the syncrepl engine as such. So this is applicable to all types of replication and it's the technical basics, right?
Yes.
18.3.2. Delta-syncrepl
I guess this is simple master-slave, isn't it? Though I fail to understand why this is about deltas while obviously the other mechanisms aren't, are they?
Well, without delta-syncrepl all data of a modified entry is synchronised. Delta-syncrepl requires a special log database while the consumer requests information on the modified entries and only requests the modified attributes.
18.3.3. N-Way Multi-Master
This was the first section which explained something which sounded like what I am looking for. So I went for it.
18.3.4. MirrorMode
Looking at the config example, I just can't tell the difference to 18.3.3. N-Way Multi-Master except:
- samples in this section are not cn=config based, but I guess that
shouldn't matter but it's just a question of which mechanism I like to use, isn't it?
Yes
- in both N-Way Multi-Master and in Mirror Mode I have serverID,
mirrormode on and syncrepl statements.
So what actually is the difference between Mirror Mode and N-Way Multi-Master except having 2 servers or three servers?
Mirror Mode is just synchronisation between two or more providers, while only one provider is active (see mirrormode on/off parameter) and the others are in stand-by mode, consumers only connect to the active provider. N-Way Multimaster is a network of active providers which synchronise data by means of syncrepl. In other words, a client may write to any of this providers and it is up to the providers to stay synchronised. Multimaster is not covered by the protocol.
-Dieter
--On Thursday, November 05, 2009 10:06 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi Dieter!
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
If this is a "do not use feature", for what reason has it been included in the software, in the first place.
Slapd in a synchronized environment is, with a few exceptions which have only been fixed recently, rock stable, I know of environments with up to 150 consumers.
When you say "synchronized", do you mean one master and n slaves?
Sounds like that is what he meant.
When you say, the requirement for N-way multi-master is usually poor directory design, I wonder if I am suffering from a misconception here, i.e. mixing up N-way multi-master and mirror mode possibly.
There's a section on this in the admin guide. I think having HA masters is desirable, thus I find Mirror Mode a plus.
What we want to achieve is a HA solution where *all* directory data is stored on more than one physical machine. I know I can do that by having a master and a slave. But then I would need to have a mechanism entirely external to slapd that if the master fails I turn the slave into a master and vice versa. (However this could be reliably achieved.)
Again, a good argument for mirror mode.
So the idea for N-way multi-master was just: I can point the DNS entry to whatever server in my cluster (possibly there may be more than two) and it will be a writeable directory and I won't ever loose any information I write into that LDAP cloud.
True. But see the section in the admin guide about the issues with MMR. And, as far as the OpenLDAP Development team is concerned, MMR is a fully supported feature. However, there have been issues with it that are being fixed as they are found. It'd be best, for example, to use current RE24 CVS Head, as there's a fix to syncprov in it.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Torsten Schlabach (Tascel eG) wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
Generally the reason for very old bugs to be fixed is because no one ever reported any problems in that area until recently. That says that not enough testing occurred to catch it earlier, or that people who ran into the bugs didn't report them earlier.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Subscribe to the openldap-bugs mailing list and monitor what bugs we're currently investigating and what patches we're currently testing and looking for feedback on.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
The first rule is make sure you're working with current source. Diagnosing problems that have already been fixed just wastes everyone's time.
Next, read the documentation and the test cases in the test suite to see how things work, or how things are expected to work. That's the best way to understand the system.
When you run into problems that our test cases don't cover, write a new test case that addresses it. The bugs that get fixed the fastest are the ones that are reported with enough information to reliably reproduce them. The best situation is to include a test case that we can add to our test suite to ensure that the situation is always tested for from then on, to detect regressions.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
A good realization, but generally you should have the answer to this question in hand before you get your first customer...
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
Hi!
Thanks for all your comments.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
A good realization, but generally you should have the answer to this question in hand before you get your first customer...
Yes, indeed, and this is where I am trying to get.
For that reason I have asked this:
http://www.openldap.org/lists/openldap-software/200911/msg00040.html
Regards, Torsten
Howard Chu schrieb:
Torsten Schlabach (Tascel eG) wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
Generally the reason for very old bugs to be fixed is because no one ever reported any problems in that area until recently. That says that not enough testing occurred to catch it earlier, or that people who ran into the bugs didn't report them earlier.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Subscribe to the openldap-bugs mailing list and monitor what bugs we're currently investigating and what patches we're currently testing and looking for feedback on.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
The first rule is make sure you're working with current source. Diagnosing problems that have already been fixed just wastes everyone's time.
Next, read the documentation and the test cases in the test suite to see how things work, or how things are expected to work. That's the best way to understand the system.
When you run into problems that our test cases don't cover, write a new test case that addresses it. The bugs that get fixed the fastest are the ones that are reported with enough information to reliably reproduce them. The best situation is to include a test case that we can add to our test suite to ensure that the situation is always tested for from then on, to detect regressions.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
A good realization, but generally you should have the answer to this question in hand before you get your first customer...
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG) tschlabach@tascel.net wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Testing reveals the presence of bugs, not the absence :) So no one can every say version x.y.z is certified bug free.
However, I do tend to agree, in that my MM just flaked out, and there is not much load/write/update going on so I am a bit worried.
I am not trying to put down OpenLDAP but iplanet/fedora directory server/389 support up to a 4 way MM implementation and I have found the replication rock solid even under high load. So if MM is your requirement that may be a more valid option.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a feature. It is not marked as experimental anymore so people are going to use it. Once it fails you can't call them a "Poor Directory Designer" for using it.
Edward Capriolo edlinuxguru@gmail.com writes:
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG) tschlabach@tascel.net wrote:
[...]
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a feature. It is not marked as experimental anymore so people are going to use it. Once it fails you can't call them a "Poor Directory Designer" for using it.
I am not blaming any user who has to implement multimaster replication, at least this has not been my intention. If you want to set up a partitioned or replicated directory you have to take a lot of requirements into consideration, only to think about the number of connections in a multimaster environment, (n *(n-1)). Just name a network topology or a directory design requirement that is dependend on multimaster replication. I personally know only of one directory where multimaster with 10 providers had to be implemented. On the other hand I do know of directories with up to 150 consumers connecting to one provider.
-Dieter
Edward Capriolo wrote:
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG) tschlabach@tascel.net wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Testing reveals the presence of bugs, not the absence :) So no one can every say version x.y.z is certified bug free.
However, I do tend to agree, in that my MM just flaked out, and there is not much load/write/update going on so I am a bit worried.
I am not trying to put down OpenLDAP but iplanet/fedora directory server/389 support up to a 4 way MM implementation and I have found the replication rock solid even under high load. So if MM is your requirement that may be a more valid option.
The historical evidence disagrees with your assertion. Even at this late date, FDS MMR still breaks irrecoverably.
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
How many years have they been flogging this feature? They still haven't got it right. They can't.
MMR is inherently flawed, as we have been saying for years.
http://www.watersprings.org/pub/id/draft-zeilenga-ldup-harmful-02.txt
We have implemented it in OpenLDAP mainly for political reasons, not because we changed our minds and now believe it to be technically sound. It is not. We developed and recommend MirrorMode because the only safe way to do replication is by preserving single-master consistency.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a feature. It is not marked as experimental anymore so people are going to use it. Once it fails you can't call them a "Poor Directory Designer" for using it.
If they have implemented MMR without reading all of the warnings, they are certainly poor designers for not becoming fully informed of the topic before deploying it. If they have implemented MMR after reading all of the warnings, they made a conscious choice.
On Wed, Nov 11, 2009 at 3:01 PM, Howard Chu hyc@symas.com wrote:
Edward Capriolo wrote:
On Thu, Nov 5, 2009 at 5:25 AM, Torsten Schlabach (Tascel eG) tschlabach@tascel.net wrote:
Hi Quanah!
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
I need to say, it worries me a bit that for problems with a core feature which has been around for quite some time, the answer is more often that I like to hear: You need to use the latest version released last week / month or so.
I have indeed read the CHANGES and seen that some issues have been fixed. I have no idea if we are affected by those issues or now.
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
But as this is a FOSS project and not a product we pay for, we understand that we should not blame people but try and help if we find a a problem.
For that reason I have asked in my email for help on *understanding* and *diagnosing* problems to have a chance to contribute in case we will find any new issues.
Also our customers may not like it if in case of a problem we tell them: Let's wait if in some weeks a new release will come which will fix it or not. So I'd rather be in a position to get my hands dirty myself in case of problems.
Regards, Torsten
Quanah Gibson-Mount schrieb:
--On Wednesday, November 04, 2009 1:12 PM +0100 "Torsten Schlabach (Tascel eG)" tschlabach@tascel.net wrote:
Hi all!
I am currently trying to chase some problems in an n-way multi-master setup with three servers. We have used the instructions at
http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
as our guidance and we are using OpenLDAP version 2.4.11.
I suggest you go read the CHANGES log for what has been fixed between 2.4.11 and the latest stable 2.4.19.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
Also how would I know that *now* in 2.4.19 all problems are fixed and the answer next week won't be: You need to use 2.4.20.
Testing reveals the presence of bugs, not the absence :) So no one can every say version x.y.z is certified bug free.
However, I do tend to agree, in that my MM just flaked out, and there is not much load/write/update going on so I am a bit worried.
I am not trying to put down OpenLDAP but iplanet/fedora directory server/389 support up to a 4 way MM implementation and I have found the replication rock solid even under high load. So if MM is your requirement that may be a more valid option.
The historical evidence disagrees with your assertion. Even at this late date, FDS MMR still breaks irrecoverably.
https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
How many years have they been flogging this feature? They still haven't got it right. They can't.
MMR is inherently flawed, as we have been saying for years.
http://www.watersprings.org/pub/id/draft-zeilenga-ldup-harmful-02.txt
We have implemented it in OpenLDAP mainly for political reasons, not because we changed our minds and now believe it to be technically sound. It is not. We developed and recommend MirrorMode because the only safe way to do replication is by preserving single-master consistency.
The answer is quite simple: do not use multimaster replication in a production environment. In most cases the requirement for multimaster replication is just based on poor directory design.
Dieter, I do not agree with that. You can't blame a user for using a feature. It is not marked as experimental anymore so people are going to use it. Once it fails you can't call them a "Poor Directory Designer" for using it.
If they have implemented MMR without reading all of the warnings, they are certainly poor designers for not becoming fully informed of the topic before deploying it. If they have implemented MMR after reading all of the warnings, they made a conscious choice.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
I understand that open LDAP does not do distributed locking, as a result I do not expect it to have ACID compliance.
Fedora Directory Server/389 has a "last update wins policy" so this is a much more optimistic strategy, but it works (for what I was doing)
Since I have joined this mailing list after my problems started, about a month ago, I have seem at least 4 other threads with similar issues.
http://www.openldap.org/lists/openldap-software/200911/msg00015.html http://www.openldap.org/lists/openldap-software/200911/msg00021.html ...
Upgrade to 2.4.19 is suggested as a resolution, and I found another thread with a bigger problem in that version.
As to the link you have posted: https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
It is very easy to quickly search a mailing list and find some people having problems software. That does not prove FDS has many MM problems. I personally ran two node FDS instance with very active WRITE/UPDATE for two years and had only a few isolated problems.
If they have implemented MMR without reading all of the warnings, they are certainly poor designers for not becoming fully informed of the topic before deploying it.
From my prospective, I find the reliability of M-M openldap on 2.4.16
brittle. I am not the only one having problems. Your comment seems to suggest I did not read enough. I would upgrade to 2.4.19 but someone else on this list is having problems with that so that does not seem like a safe option.
Since I have installed openldap on two lightly traffic nodes: 1) One node locked up 2) After lockup/restart the nodes did not re-establish two way replication connection 3) I have out of sync data (which I do not believe was added during the downtime caused by 1)
Linking to an RFC and implying that I "Don't read enough" is wrong. If my light usage is bringing to light obvious bugs and I am not the only one having these issues, not enough testing on the software development side is being done.
As an administrator I ran 'make test' and watched test050-syncrepl-multimaster complete. That coupled with the fact that multi-master is no longer being labeled as an "experimental" feature led me believe it worked reasonably well.
The RFC makes no mention of my #2 problem 'After lockup/restart the nodes did not re-establish two way replication connection'. Is that supposed to be the fault of the user? This is obviously a bug or an edge case. This is not the fault of a user not reading enough. Which is where the frustration is I think. People are willing to accept the failure cases covered in the RFC, but the RFC is not a blanked statement "WE told you not to run this" for every bug that appears.
Edward Capriolo wrote:
As to the link you have posted: https://www.redhat.com/archives/fedora-directory-users/2009-November/msg0005...
It is very easy to quickly search a mailing list and find some people having problems software. That does not prove FDS has many MM problems. I personally ran two node FDS instance with very active WRITE/UPDATE for two years and had only a few isolated problems.
Sure, it's easy to find problems in anything when you search back far enough into the past. The fact that it's so easy to find problems so *recently* is what I was pointing out. And your personal anecdote doesn't change the fact that it is broken.
Since I have installed openldap on two lightly traffic nodes:
- One node locked up
- After lockup/restart the nodes did not re-establish two way
replication connection 3) I have out of sync data (which I do not believe was added during the downtime caused by 1)
Linking to an RFC and implying that I "Don't read enough" is wrong. If my light usage is bringing to light obvious bugs and I am not the only one having these issues, not enough testing on the software development side is being done.
Fair enough, I agree. The fact that these problems are showing up so quickly in the field means that we haven't done enough testing prior to release. If you can condense your problem scenarios into something we can easily reproduce and subsequently include in the test suite, then we can attack these problems and move forward.
Torsten Schlabach (Tascel eG) wrote:
To begin with, I would just ask for confirmation of my proper understanding of the documentation:
- A master server is a server which is using the syncprov overlay
(servers/slapd/overlays/syncprov.c). This overlay will do little more than just provide a synchronization cookie (CSN) which consumers may ask for to find out what needs to get replicated and what not.
Pretty much, yes. We use the term "provider", not "master".
- A consumer server is a server in which an additional thread is
running which will query the master(s) in a given interval to ask for updated and if any, get them over the wire and into the local copy of the database. This synchronization thread is servers/slapd/syncrepl.c I guess?
There is no additional thread. In refreshOnly mode it will periodically ask for new updates; in refreshAndPersist mode it will keep an LDAP session open and receive updates as they occur.
- An N-Way Multi-Master setup is a setup in which N servers are each a
master and any of the others is a consumer of all other masters?
N servers are each a provider; any of them can be consumers of any number of the other providers. Full NxN connectivity is not a requirement, nor does it scale well.
So what I fail to understand is:
- What is the difference between Mirror Mode and N-Way Multi-Master?
Especially given that in N-Way Multi-Master, have to set olcMirrorMode to TRUE.
MirrorMode relies on an external frontend to direct all updates to a single provider. There is no difference within the OpenLDAP code between MirrorMode and MultiMaster; the difference is entirely external, based on your deployment.
- Given that I have added a 'Sync' value to the olcLogLevel attribute,
what would be the "health check" information I should be watching in the log for to see that replication is attempted as expected.
- What problems should I be watching for in the logs?
Error messages related to connect failures, retries, etc.
- Could I for example manually ask a master (using some ldapsearch
statement, pretending I was the consumer) what the master thinks which entries I would have to update?
Yes, use ldapsearch -E sync. See the ldapsearch(1) manpage.
Hi Howard!
MirrorMode relies on an external frontend to direct all updates to a single provider. There is no difference within the OpenLDAP code between MirrorMode and MultiMaster; the difference is entirely external, based on your deployment.
So when you say: "We recommend not to use MMR", you're not saying don't use code section X of openldap but use code section Y, but you're basically saying: Do not send different updates to different providers at the same time, right?
In other words: If I would configure MMR between say 3 nodes but I would make sure that only one of them will receive updates at a time while on the other two there will only be read operations, then I have MirrorMode?
Is that right?
for new updates; in refreshAndPersist mode it will keep an LDAP session open and receive updates as they occur.
Could I query a provider to show me the current open sessions held by the consumers?
When is refreshOnly and when is refreshAndPersist recommended?
I was always confused in the documentation, as it says: sncrepl is entirely a client side technology and then it says you have the option to either pull or push updates. Now this becomes a bit clearer to me. Thanks for that.
Regards, Torsten
Howard Chu schrieb:
Torsten Schlabach (Tascel eG) wrote:
To begin with, I would just ask for confirmation of my proper understanding of the documentation:
- A master server is a server which is using the syncprov overlay
(servers/slapd/overlays/syncprov.c). This overlay will do little more than just provide a synchronization cookie (CSN) which consumers may ask for to find out what needs to get replicated and what not.
Pretty much, yes. We use the term "provider", not "master".
- A consumer server is a server in which an additional thread is
running which will query the master(s) in a given interval to ask for updated and if any, get them over the wire and into the local copy of the database. This synchronization thread is servers/slapd/syncrepl.c I guess?
There is no additional thread. In refreshOnly mode it will periodically ask for new updates; in refreshAndPersist mode it will keep an LDAP session open and receive updates as they occur.
- An N-Way Multi-Master setup is a setup in which N servers are each a
master and any of the others is a consumer of all other masters?
N servers are each a provider; any of them can be consumers of any number of the other providers. Full NxN connectivity is not a requirement, nor does it scale well.
So what I fail to understand is:
- What is the difference between Mirror Mode and N-Way Multi-Master?
Especially given that in N-Way Multi-Master, have to set olcMirrorMode to TRUE.
MirrorMode relies on an external frontend to direct all updates to a single provider. There is no difference within the OpenLDAP code between MirrorMode and MultiMaster; the difference is entirely external, based on your deployment.
- Given that I have added a 'Sync' value to the olcLogLevel attribute,
what would be the "health check" information I should be watching in the log for to see that replication is attempted as expected.
- What problems should I be watching for in the logs?
Error messages related to connect failures, retries, etc.
- Could I for example manually ask a master (using some ldapsearch
statement, pretending I was the consumer) what the master thinks which entries I would have to update?
Yes, use ldapsearch -E sync. See the ldapsearch(1) manpage.
Torsten Schlabach (Tascel eG) wrote:
Hi Howard!
MirrorMode relies on an external frontend to direct all updates to a single provider. There is no difference within the OpenLDAP code between MirrorMode and MultiMaster; the difference is entirely external, based on your deployment.
So when you say: "We recommend not to use MMR", you're not saying don't use code section X of openldap but use code section Y, but you're basically saying: Do not send different updates to different providers at the same time, right?
Yes.
In other words: If I would configure MMR between say 3 nodes but I would make sure that only one of them will receive updates at a time while on the other two there will only be read operations, then I have MirrorMode?
Is that right?
Yes.
for new updates; in refreshAndPersist mode it will keep an LDAP session open and receive updates as they occur.
Could I query a provider to show me the current open sessions held by the consumers?
You can get a look at open sessions from cn=monitor but I don't think the syncrepl sessions are highlighted in any particular way. What are you expecting this to tell you?
When is refreshOnly and when is refreshAndPersist recommended?
We don't make any particular recommendation. Some sites want "instantaneous" replication; for them refreshAndPersist is the obvious choice. Other sites only need coarse synchronization, or don't want to have long-lived LDAP sessions open all the time. So they choose refreshOnly. On my G1 phone my slapd is configured in refreshOnly syncing my home addressbook (with a 12 hour refresh interval) because I just don't make changes often enough for it to matter.
I was always confused in the documentation, as it says: sncrepl is entirely a client side technology and then it says you have the option to either pull or push updates. Now this becomes a bit clearer to me. Thanks for that.
openldap-software@openldap.org