multi-master syncrepl issue

List overview All Threads
Download

newer

older

Query regarding the Database...

slapadd *very* slow: tuning advice?

Chris Card

6 Aug 2012 6 Aug '12

4:53 a.m.

Hi All, I have a multi-master openldap setup with 2 machines replicating a directory containing about 3.5 million entries. I'm running openldap 2.4.31 on centos 6, and the directory is using the BDB backend. Although the 2 machines are configured for multi-master syncrepl replication, in practice data is only written to one of the machines (I'll call it the master), and the second machine (which I'll call the slave) only gets data written by openldap replication. Currently the contextCSN of the directory is the same from both machines, which (as I understand it) should mean that the directories are in sync, but I have written a program to compare what is in both directories which finds that there are 16 entries in the master directory not in the slave directory. I have double checked thisusing ldapsearch on both directories. I can't see any error messages in the openldap log and there doesn't appear to be any pattern connecting the entries which are missing from the slave. Most of the missing entries werein the master directory before I created the slave machine and configured replication and have not changed. The syncrepl config looks like this: dn: olcDatabase={1}bdb,cn=configolcSyncrepl: {0}rid=101 provider="ldap://<master>:389" binddn="<binddn>" bindmethod=simple credentials=<bindpw> searchbase="<prefix>" type=refreshAndPersist retry="5 5 300 5" timeout=1olcSyncrepl: {1}rid=110 provider="ldap://<slave>:389" binddn="<binddn>" bindmethod=simple credentials=<bindpw> searchbase="<prefix>" type=refreshAndPersist retry="5 5 300 5" timeout=1 Are there any known issues with openldap replication which could result in missing data? How can I force these missing entries to appear in the slave without rebuilding the whole of the slave directory and without changing the data in the master directory? Chris

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

Chris Card

21 Aug 21 Aug

6:10 a.m.

Hi All,

In addition to the questions below (which I'd still like to see an answer to), I have a further related question.

I now have 4 machines set up for multi-master replication of this directory, and the contextCSNs were all in sync last week. One of the machines was taken down for maintenance for a couple of days and has only just come back up, so its contextCSN is way behind (20120818093702.01462Z#000000#001#000000 compared to 20120821130410.679339Z#000000#001#000000 as of now). This machine has now been up and running for a few hours, but there's no sign of replication catching up. Should ldap replication sort itself out automatically and bring this machine's directory in line? Or do I need to do something manually to force it?

Chris

From: ctcard@hotmail.com To: openldap-technical@openldap.org Subject: multi-master syncrepl issue Date: Mon, 6 Aug 2012 11:53:53 +0000

Hi All,

I have a multi-master openldap setup with 2 machines replicating a directory containing about 3.5 million entries.

I'm running openldap 2.4.31 on centos 6, and the directory is using the BDB backend.

Although the 2 machines are configured for multi-master syncrepl replication, in practice data is only written to one of the machines (I'll call it the master), and the second machine (which I'll call the slave) only gets data written by openldap replication.

Currently the contextCSN of the directory is the same from both machines, which (as I understand it) should mean that the directories are in sync, but I have written a program to compare what is in both directories which finds that there are 16 entries in the master directory not in the slave directory. I have double checked this using ldapsearch on both directories.

I can't see any error messages in the openldap log and there doesn't appear to be any pattern connecting the entries which are missing from the slave. Most of the missing entries were in the master directory before I created the slave machine and configured replication and have not changed.

The syncrepl config looks like this:

dn: olcDatabase={1}bdb,cn=config olcSyncrepl: {0}rid=101 provider="ldap://<master>:389" binddn="<binddn>" bindmethod=simple credentials=<bindpw> searchbase="<prefix>" type=refreshAndPersist retry="5 5 300 5" timeout=1 olcSyncrepl: {1}rid=110 provider="ldap://<slave>:389" binddn="<binddn>" bindmethod=simple credentials=<bindpw> searchbase="<prefix>" type=refreshAndPersist retry="5 5 300 5" timeout=1

Are there any known issues with openldap replication which could result in missing data?

How can I force these missing entries to appear in the slave without rebuilding the whole of the slave directory and without changing the data in the master directory?

Chris

Quanah Gibson-Mount

7:34 a.m.

--On Tuesday, August 21, 2012 1:10 PM +0000 Chris Card ctcard@hotmail.com wrote:

...

Hi All,

In addition to the questions below (which I'd still like to see an answer to), I have a further related question.

I now have 4 machines set up for multi-master replication of this directory, and the contextCSNs were all in sync last week. One of the machines was taken down for maintenance for a couple of days and has only just come back up, so its contextCSN is way behind (20120818093702.01462Z#000000#001#000000 compared to 20120821130410.679339Z#000000#001#000000 as of now). This machine has now been up and running for a few hours, but there's no sign of replication catching up. Should ldap replication sort itself out automatically and bring this machine's directory in line? Or do I need to do something manually to force it?

Do you have sync logging enabled?

In any case, with a server that large (3.5 million entries), I expect it will generally take a few months for it to catch up, since with standard syncrepl, it is going to try and refresh the entire database if it passed your sync ops log. I would strongly advise delta-syncrepl MMR instead, although that may take a significant amount of disk space for such a large DB if it is heavily active.

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Chris Card

7:50 a.m.

...

--On Tuesday, August 21, 2012 1:10 PM +0000 Chris Card ctcard@hotmail.com wrote:

...
Hi All,

In addition to the questions below (which I'd still like to see an answer to), I have a further related question.

I now have 4 machines set up for multi-master replication of this directory, and the contextCSNs were all in sync last week. One of the machines was taken down for maintenance for a couple of days and has only just come back up, so its contextCSN is way behind (20120818093702.01462Z#000000#001#000000 compared to 20120821130410.679339Z#000000#001#000000 as of now). This machine has now been up and running for a few hours, but there's no sign of replication catching up. Should ldap replication sort itself out automatically and bring this machine's directory in line? Or do I need to do something manually to force it?

Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

...

In any case, with a server that large (3.5 million entries), I expect it will generally take a few months for it to catch up, since with standard syncrepl, it is going to try and refresh the entire database if it passed your sync ops log. I would strongly advise delta-syncrepl MMR instead, although that may take a significant amount of disk space for such a large DB if it is heavily active.

The initial load of the directory via replication took < 2 days, so why would it take months to catch up? I tried restarting slapd with the -c flag specifying the current csn of the database, but it didn't seem to make any difference - is that expected? I'll take a look at using delta-syncrepl, since disk space isn't an issue.

Chris

Quanah Gibson-Mount

7:56 a.m.

--On Tuesday, August 21, 2012 2:50 PM +0000 Chris Card ctcard@hotmail.com wrote:

...

...
Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

Fix your log level.

...

...
In any case, with a server that large (3.5 million entries), I expect it will generally take a few months for it to catch up, since with standard syncrepl, it is going to try and refresh the entire database if it passed your sync ops log. I would strongly advise delta-syncrepl MMR instead, although that may take a significant amount of disk space for such a large DB if it is heavily active.

The initial load of the directory via replication took < 2 days, so why would it take months to catch up? I tried restarting slapd with the -c flag specifying the current csn of the database, but it didn't seem to make any difference - is that expected? I'll take a look at using delta-syncrepl, since disk space isn't an issue.

Syncrepl is slow, and your database will have constant changes even while replication is ongoing -- So if in 2 days it is caught up to where the other servers were two days ago, it will have to go back through and restart comparisons. So it may take well under a month, the point is, it will take a significant amount of time. In addition, it is going to have to take changes from 4 servers, which may cause deadlocks and other issues in the DB, assuming you're using BDB as the backend.

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Chris Card

8:05 a.m.

----------------------------------------

...

Date: Tue, 21 Aug 2012 07:56:20 -0700 From: quanah@zimbra.com To: ctcard@hotmail.com; openldap-technical@openldap.org Subject: RE: multi-master syncrepl issue

--On Tuesday, August 21, 2012 2:50 PM +0000 Chris Card ctcard@hotmail.com wrote:

...
...
Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

Fix your log level.

olcLogLevel: sync ?

...

...
...
In any case, with a server that large (3.5 million entries), I expect it will generally take a few months for it to catch up, since with standard syncrepl, it is going to try and refresh the entire database if it passed your sync ops log. I would strongly advise delta-syncrepl MMR instead, although that may take a significant amount of disk space for such a large DB if it is heavily active.

The initial load of the directory via replication took < 2 days, so why would it take months to catch up? I tried restarting slapd with the -c flag specifying the current csn of the database, but it didn't seem to make any difference - is that expected? I'll take a look at using delta-syncrepl, since disk space isn't an issue.

Syncrepl is slow, and your database will have constant changes even while replication is ongoing -- So if in 2 days it is caught up to where the other servers were two days ago, it will have to go back through and restart comparisons. So it may take well under a month, the point is, it will take a significant amount of time.

Would you expect catch-up with delta-syncrepl to much faster?

...

In addition, it is going to have to take changes from 4 servers, which may cause deadlocks and other issues in the DB, assuming you're using BDB as the backend.

There are 4 servers, but all the changes go to one of them, the others are there as read-only servers, so I don't expect problems with deadlocking. I am also looking into mdb, but that's an issue for another thread.

Quanah Gibson-Mount

8:18 a.m.

--On Tuesday, August 21, 2012 3:05 PM +0000 Chris Card ctcard@hotmail.com wrote:

...

...
Date: Tue, 21 Aug 2012 07:56:20 -0700 From: quanah@zimbra.com To: ctcard@hotmail.com; openldap-technical@openldap.org Subject: RE: multi-master syncrepl issue

--On Tuesday, August 21, 2012 2:50 PM +0000 Chris Card ctcard@hotmail.com wrote:

...
...
Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

Fix your log level.

olcLogLevel: sync ?

Yes

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Brett Maxfield

4:18 p.m.

On 21/08/2012, at 9:50 PM, Chris Card ctcard@hotmail.com wrote:

...

...
Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

FYI Sync logging, aka changelog, is not syslog logging, it's different.. Sync log tracks changes to do with your ldap, in a online db backend, so ldap replicas can scan it for incremental changes, rather than read every entry to see if it's changed..

Cheers Brett

Quanah Gibson-Mount

4:38 p.m.

--On Wednesday, August 22, 2012 6:18 AM +0700 Brett Maxfield brett.maxfield@gmail.com wrote:

...

On 21/08/2012, at 9:50 PM, Chris Card ctcard@hotmail.com wrote:

...
...
Do you have sync logging enabled?

Log level is set to none, so the slapd log doesn't give much help.

FYI Sync logging, aka changelog, is not syslog logging, it's different.. Sync log tracks changes to do with your ldap, in a online db backend, so ldap replicas can scan it for incremental changes, rather than read every entry to see if it's changed..

Wrong.

The "sync" loglevel prints out to syslog what the master(s) and replica(s) are doing in relation to syncrepl.

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

4710

Age (days ago)

4725

Last active (days ago)

openldap-technical@openldap.org

8 comments

3 participants

tags (0)

participants (3)

Brett Maxfield
Chris Card
Quanah Gibson-Mount