Delta-syncrepl and latency

List overview All Threads
Download

newer

older

help with meta backend

ppolicy and admins

Sam Tran

9 May 2007 9 May '07

2:50 a.m.

Hi All,

I am testing delta-syncrepl using OpenLDAP 2.3.35 (+ fixes ITS#4924 and fixes ITS#4925) and Berkeley DB 4.5.20 on Centos 5.

I have a provider with the following configuration:

#------------------------------------------------- # Accesslog DB definition (slapo-accesslog) #------------------------------------------------- # Access logging overlay

database hdb suffix "cn=accesslog" rootdn "cn=root,cn=accesslog" directory /var/lib/ldap/accesslog index default eq index entryUUID,entryCSN,objectClass,reqEnd,reqResult,reqStart

# Let the replica DN have limitless searches limits dn.exact="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited

#------------------------------------------------- # End of Accesslog DB definition (slapo-accesslog) #-------------------------------------------------

# Syncrepl overlay - Provider configuration for suffix "cn=accesslog" overlay syncprov syncprov-nopresent TRUE syncprov-reloadhint TRUE

#------------------------------------------------- # Primary database definition #-------------------------------------------------

database hdb suffix "dc=example,dc=com" rootdn "cn=Manager,dc=example,dc=com" rootpw secret directory /var/lib/ldap/example.com cachesize 5000 idlcachesize 2000 checkpoint 512 10

index objectClass,uidNumber,gidNumber,memberUid,employeeNumber eq,pres index employeeType,accountActive,ftpActive,mailActive,vacationActive,ou,mailRoutingAddress eq index cn,mail,surname,givenname eq,pres,subinitial index displayName,gecos,telephoneNumber sub,subany index uid,aliasUid eq,sub,subany index entryUUID,entryCSN eq

# Let the replica DN have limitless searches limits dn.exact="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited #------------------------------------------------- # End of primary database definition #-------------------------------------------------

# Syncrepl - Provider configuration overlay syncprov syncprov-checkpoint 100 30 syncprov-sessionlog 100

# Access Logging overlay overlay accesslog logdb cn=accesslog logops writes logsuccess TRUE logpurge 28+00:00 01+00:00

Here is the configuration of the consumer:

############################################################# # Syncrepl - Consumer configuration ############################################################# syncrepl rid=006 provider=ldaps://provider.example.com:636 bindmethod=simple binddn="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com" credentials=secret type=refreshAndPersist retry="5 +" searchbase="dc=example,dc=com" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemachecking=on syncdata=accesslog

I've been running some LDAP writes (batches of 10 to 10000 sequential writes) on the provider to test the replication on the consumer. Sometimes I observed some latency up to 35 sec. on the consumer with the following messages in the logs: May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrep2: rid 006 Can't contact LDAP server May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrepl: rid 006 retrying

Some other times, there was not replication at all until I restarted the consumer.

Am I missing something in the configuration? Please could you shed some light on the matter?

Thanks in advance. Sam

Show replies by date

Quanah Gibson-Mount

9 May 9 May

3:23 a.m.

--On Tuesday, May 08, 2007 8:50 PM -0400 Sam Tran stlist@gmail.com wrote:

...

Hi All,

I am testing delta-syncrepl using OpenLDAP 2.3.35 (+ fixes ITS#4924 and fixes ITS#4925) and Berkeley DB 4.5.20 on Centos 5.

I have a provider with the following configuration:

# ------------------------------------------------- # Accesslog DB definition (slapo-accesslog) # ------------------------------------------------- # Access logging overlay

database hdb suffix "cn=accesslog" rootdn "cn=root,cn=accesslog" directory /var/lib/ldap/accesslog index default eq index entryUUID,entryCSN,objectClass,reqEnd,reqResult,reqStart

# Let the replica DN have limitless searches limits dn.exact="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited

# ------------------------------------------------- # End of Accesslog DB definition (slapo-accesslog) # -------------------------------------------------

# Syncrepl overlay - Provider configuration for suffix "cn=accesslog" overlay syncprov syncprov-nopresent TRUE syncprov-reloadhint TRUE

# ------------------------------------------------- # Primary database definition # -------------------------------------------------

database hdb suffix "dc=example,dc=com" rootdn "cn=Manager,dc=example,dc=com" rootpw secret directory /var/lib/ldap/example.com cachesize 5000 idlcachesize 2000 checkpoint 512 10

index objectClass,uidNumber,gidNumber,memberUid,employeeNumber eq,pres index employeeType,accountActive,ftpActive,mailActive,vacationActive,ou,mailRou tingAddress eq index cn,mail,surname,givenname eq,pres,subinitial index displayName,gecos,telephoneNumber sub,subany index uid,aliasUid eq,sub,subany index entryUUID,entryCSN eq

# Let the replica DN have limitless searches limits dn.exact="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited # ------------------------------------------------- # End of primary database definition # -------------------------------------------------

# Syncrepl - Provider configuration overlay syncprov syncprov-checkpoint 100 30 syncprov-sessionlog 100

# Access Logging overlay overlay accesslog logdb cn=accesslog logops writes logsuccess TRUE logpurge 28+00:00 01+00:00

Here is the configuration of the consumer:

############################################################# # Syncrepl - Consumer configuration ############################################################# syncrepl rid=006 provider=ldaps://provider.example.com:636 bindmethod=simple binddn="cn=syncrepl,ou=Accounts,ou=Apps,dc=example,dc=com" credentials=secret type=refreshAndPersist retry="5 +" searchbase="dc=example,dc=com" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemachecking=on syncdata=accesslog

I've been running some LDAP writes (batches of 10 to 10000 sequential writes) on the provider to test the replication on the consumer. Sometimes I observed some latency up to 35 sec. on the consumer with the following messages in the logs: May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrep2: rid 006 Can't contact LDAP server May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrepl: rid 006 retrying

Some other times, there was not replication at all until I restarted the consumer.

Am I missing something in the configuration? Please could you shed some light on the matter?

How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Sam Tran

6:46 p.m.

On 5/8/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...

How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

Here is my DB_CONFIG for the accesslog backend:

set_cachesize 0 52428800 0 set_lg_regionmax 1048576 set_lg_max 10485760 set_lg_bsize 2097152 set_lg_dir /var/log/openldap_bdb/accesslog set_tmp_dir /tmp

Sam

Quanah Gibson-Mount

6:59 p.m.

--On May 9, 2007 12:46:18 PM -0400 Sam Tran stlist@gmail.com wrote:

...

On 5/8/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...
How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

Here is my DB_CONFIG for the accesslog backend:

set_cachesize 0 52428800 0 set_lg_regionmax 1048576 set_lg_max 10485760 set_lg_bsize 2097152 set_lg_dir /var/log/openldap_bdb/accesslog set_tmp_dir /tmp

Have you looked at the output of db_stat -m to see how things are behaving?

--Quanah

-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Sam Tran

8:51 p.m.

On 5/9/07, Quanah Gibson-Mount quanah@zimbra.com wrote:

...

--On May 9, 2007 12:46:18 PM -0400 Sam Tran stlist@gmail.com wrote:

...
On 5/8/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...
How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

Here is my DB_CONFIG for the accesslog backend:

set_cachesize 0 52428800 0 set_lg_regionmax 1048576 set_lg_max 10485760 set_lg_bsize 2097152 set_lg_dir /var/log/openldap_bdb/accesslog set_tmp_dir /tmp

Have you looked at the output of db_stat -m to see how things are behaving?

I ran 1,000 sequential writes on the provider. Then run the command you specified. Here is the output:

[root@info-ldap-004 accesslog]# /usr/local/BerkeleyDB.4.5/bin/db_stat -m 62MB 513KB 900B Total cache size 1 Number of caches 62MB 520KB Pool individual cache size 0 Maximum memory-mapped file size 0 Maximum open file descriptors 0 Maximum sequential buffer writes 0 Sleep after writing maximum sequential buffers 0 Requested pages mapped into the process' address space 904312 Requested pages found in the cache (99%) 16 Requested pages not found in the cache 1483 Pages created in the cache 16 Pages read into the cache 38 Pages written from the cache to the backing file 0 Clean pages forced from the cache 0 Dirty pages forced from the cache 0 Dirty pages written by trickle-sync thread 1499 Current total page count 4 Current clean page count 1495 Current dirty page count 8191 Number of hash buckets used for page location 905825 Total number of times hash chains searched for a page 2 The longest hash chain searched for a page 921930 Total number of hash chain entries checked for page 1 The number of hash bucket locks that required waiting (0%) 1 The maximum number of times any hash bucket lock was waited for (0%) 0 The number of region locks that required waiting (0%) 0 The number of buffers frozen 0 The number of buffers thawed 0 The number of frozen buffers freed 1539 The number of page allocations 0 The number of hash buckets examined during allocations 0 The maximum number of hash buckets examined for an allocation 0 The number of pages examined during allocations 0 The max number of pages examined for an allocation 0 Threads waited on page I/O Pool File: reqStart.bdb 4096 Page size 0 Requested pages mapped into the process' address space 51707 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 37 Pages created in the cache 2 Pages read into the cache 3 Pages written from the cache to the backing file Pool File: entryUUID.bdb 4096 Page size 0 Requested pages mapped into the process' address space 43212 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 69 Pages created in the cache 2 Pages read into the cache 4 Pages written from the cache to the backing file Pool File: objectClass.bdb 4096 Page size 0 Requested pages mapped into the process' address space 382439 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 114 Pages created in the cache 2 Pages read into the cache 4 Pages written from the cache to the backing file Pool File: reqEnd.bdb 4096 Page size 0 Requested pages mapped into the process' address space 51688 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 37 Pages created in the cache 2 Pages read into the cache 3 Pages written from the cache to the backing file Pool File: id2entry.bdb 16384 Page size 0 Requested pages mapped into the process' address space 112069 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 668 Pages created in the cache 2 Pages read into the cache 9 Pages written from the cache to the backing file Pool File: reqResult.bdb 4096 Page size 0 Requested pages mapped into the process' address space 84952 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 28 Pages created in the cache 2 Pages read into the cache 3 Pages written from the cache to the backing file Pool File: entryCSN.bdb 4096 Page size 0 Requested pages mapped into the process' address space 52002 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 37 Pages created in the cache 2 Pages read into the cache 4 Pages written from the cache to the backing file Pool File: dn2id.bdb 4096 Page size 0 Requested pages mapped into the process' address space 126243 Requested pages found in the cache (99%) 2 Requested pages not found in the cache 493 Pages created in the cache 2 Pages read into the cache 8 Pages written from the cache to the backing file

According to the output most of the data is in cache. It took about 5 minutes for the consumer to get the updates. I still don't know why.

Sam

Quanah Gibson-Mount

9:14 p.m.

--On May 9, 2007 2:51:23 PM -0400 Sam Tran stlist@gmail.com wrote:

...

On 5/9/07, Quanah Gibson-Mount quanah@zimbra.com wrote:

...
--On May 9, 2007 12:46:18 PM -0400 Sam Tran stlist@gmail.com wrote:

...
On 5/8/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...
How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

Here is my DB_CONFIG for the accesslog backend:

set_cachesize 0 52428800 0 set_lg_regionmax 1048576 set_lg_max 10485760 set_lg_bsize 2097152 set_lg_dir /var/log/openldap_bdb/accesslog set_tmp_dir /tmp

Have you looked at the output of db_stat -m to see how things are behaving?

I ran 1,000 sequential writes on the provider. Then run the command you specified. Here is the output:

According to the output most of the data is in cache. It took about 5 minutes for the consumer to get the updates. I still don't know why.

Well, things seem fine from the accesslog DB side. On my servers @ Stanford, the largest latency I've seen is about 7 seconds. I wonder if something is slow on your network.

--Quanah

-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Sam Tran

10:52 p.m.

On 5/9/07, Quanah Gibson-Mount quanah@zimbra.com wrote:

...

--On May 9, 2007 2:51:23 PM -0400 Sam Tran stlist@gmail.com wrote:

...
On 5/9/07, Quanah Gibson-Mount quanah@zimbra.com wrote:

...
--On May 9, 2007 12:46:18 PM -0400 Sam Tran stlist@gmail.com wrote:

...
On 5/8/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...
How have you tuned DB_CONFIG for the accesslog backend?

--Quanah

Here is my DB_CONFIG for the accesslog backend:

set_cachesize 0 52428800 0 set_lg_regionmax 1048576 set_lg_max 10485760 set_lg_bsize 2097152 set_lg_dir /var/log/openldap_bdb/accesslog set_tmp_dir /tmp

Have you looked at the output of db_stat -m to see how things are behaving?

I ran 1,000 sequential writes on the provider. Then run the command you specified. Here is the output:

According to the output most of the data is in cache. It took about 5 minutes for the consumer to get the updates. I still don't know why.

Well, things seem fine from the accesslog DB side. On my servers @ Stanford, the largest latency I've seen is about 7 seconds. I wonder if something is slow on your network.

I don't think the network speed is the problem here. I am running a host-based firewall on the consumer server. I only allow traffic on port 636 between the consumer and the provider. Do I need to open other holes for delta-syncrepl to work?

Thanks, Sam

Sam Tran

10 May 10 May

3:26 a.m.

On 5/9/07, Quanah Gibson-Mount quanah@zimbra.com wrote: [snip]

...

Well, things seem fine from the accesslog DB side. On my servers @ Stanford, the largest latency I've seen is about 7 seconds. I wonder if something is slow on your network.

--Quanah

I set the loglevel to -1 on the consumer and ran further test:

The provider was idle for about 1 hour 45 min. At 20:29:50 I ran a batch of 100 sequential writes on the provider. Nothing happened on the consumer until 20:45:03. Here are the logs from the consumer:

May 9 20:40:04 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:40:04 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:03 info-ldap-006 slapd[32288]: 13r May 9 20:45:03 info-ldap-006 slapd[32288]: May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: read active on 13 May 9 20:45:03 info-ldap-006 slapd[32288]: connection_get(13) May 9 20:45:03 info-ldap-006 slapd[32288]: connection_get(13): got connid=0 May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: =>do_syncrepl rid 006 May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: =>do_syncrep2 rid 006 May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:03 info-ldap-006 slapd[32288]: do_syncrep2: rid 006 Can't contact LDAP server May 9 20:45:03 info-ldap-006 slapd[32288]: May 9 20:45:03 info-ldap-006 slapd[32288]: connection_get(13) May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: hangup on 13 May 9 20:45:03 info-ldap-006 slapd[32288]: connection_get(13): got connid=0 May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: removing 13 May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:03 info-ldap-006 slapd[32288]: May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: do_syncrepl: rid 006 retrying May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:03 info-ldap-006 slapd[32288]: May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:03 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: =>do_syncrepl rid 006 May 9 20:45:08 info-ldap-006 slapd[32288]: =>do_syncrep2 rid 006 May 9 20:45:08 info-ldap-006 slapd[32288]: connection_get(13) May 9 20:45:08 info-ldap-006 slapd[32288]: connection_get(13): got connid=0 May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: added 13r listener=(nil) May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:08 info-ldap-006 slapd[32288]: May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: activity on 1 descriptor May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: activity on: May 9 20:45:08 info-ldap-006 slapd[32288]: 13r May 9 20:45:08 info-ldap-006 slapd[32288]: May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: read active on 13 May 9 20:45:08 info-ldap-006 slapd[32288]: connection_get(13) May 9 20:45:08 info-ldap-006 slapd[32288]: connection_get(13): got connid=0 May 9 20:45:08 info-ldap-006 slapd[32288]: =>do_syncrepl rid 006 May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=7 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: daemon: epoll: listen=8 active_threads=0 tvp=zero May 9 20:45:08 info-ldap-006 slapd[32288]: =>do_syncrep2 rid 006 May 9 20:45:08 info-ldap-006 slapd[32288]: >>> dnPrettyNormal: <uid=chill,ou=People,dc=example,dc=com>

It looks like the consumer didn't detect that its previous connection to the provider had closed. It seems like the FW between the consumer and the provider is killing the connection without sending a TCP Reset to both sides. I definitely need to talk to our network guys about that.

Now my question is what triggered the 'do_syncrepl ' at 20:45:03? Why didn't the consumer perform the 'do_syncrepl' earlier and realize that the connection to the provider was lost?

Thanks, Sam

Sam Tran

11 May 11 May

7:40 p.m.

On 5/9/07, Sam Tran stlist@gmail.com wrote:

...

It looks like the consumer didn't detect that its previous connection to the provider had closed. It seems like the FW between the consumer and the provider is killing the connection without sending a TCP Reset to both sides. I definitely need to talk to our network guys about that.

Here is what I found out: When the idle connection (no write on the provider) reaches the FW TCP idle timeout threshold, the FW sends out an RST rather than a FIN to close the connection. That is the reason why the consumer doesn't know about the connection loss and doesn't try to reconnect right after, resulting a in a latency.

I may have to use syncrepl in RefreshOnly mode or to make sure that the connection doesn't get idle for too long.

Thanks for the pointers. Sam

Sam Tran

9 May 9 May

6:48 p.m.

On 5/8/07, Sam Tran stlist@gmail.com wrote: [snip]

...

I've been running some LDAP writes (batches of 10 to 10000 sequential writes) on the provider to test the replication on the consumer. Sometimes I observed some latency up to 35 sec. on the consumer with the following messages in the logs: May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrep2: rid 006 Can't contact LDAP server May 8 16:52:08 info-ldap-006 slapd[24480]: do_syncrepl: rid 006 retrying

Some other times, there was not replication at all until I restarted the consumer.

Am I missing something in the configuration? Please could you shed some light on the matter?

Thanks in advance. Sam

Some additional information:

The provider is in the internal network while the consumer is in a separated DMZ. A static IP mapping for the provider and the appropriate FW rule are necessary to allow the consumer to reach the provider.

Sam

6628

Age (days ago)

6630

Last active (days ago)

openldap-software@openldap.org

9 comments

2 participants

tags (0)

participants (2)

Quanah Gibson-Mount
Sam Tran