I have a master and nine delta-syncrepl replicas all running on RHEL 4 using OpenLDAP 2.3.38 on the master and all replicas.
The master is configured to purge entries from the accesslog after 12 hours (and checks every 2 hours). I forget where I grabbed that from why I used it.
Anyway, I have discovered that if one of the replicas gets more than 12 hours behind (ie, is shutdown or we reload the database from yesterday), that it will grab all the updates that have happened in the last 12 hours and then the CSN of the replica and the master will be in agreement, but the changes that happened more than the 12 hours (of the accesslog) ago are not present on the replica.
Is this a configuration mistake I have made (other than setting the accesslog purge time to 12 hours) or is this a limitation (that I likely knew at one time and then forgot and now have relearned the hard way)?
Is there an option I can set on the replica's so they will refuse to start if their CSN is older than the oldest record in the accesslog of the master when they start up?
If this has been discussed in the past, I'm sorry, I did search, but not certain of the terms to use, I didn't have any luck finding anything that looked promising.
Here's the relevant parts of the master's slapd.conf
database hdb suffix cn=accesslog directory /var/lib/ldap/accesslog rootdn cn=accesslog checkpoint 1024 5 index default eq index entryCSN,objectClass,reqEnd,reqResult,reqStart overlay syncprov syncprov-nopresent TRUE syncprov-reloadhint TRUE
database bdb suffix dc=example,dc=com ... overlay syncprov syncprov-checkpoint 1000 60
overlay accesslog logdb cn=accesslog logops writes logsuccess TRUE logpurge 12:00 02:00
And from the replica's:
database bdb suffix dc=example,dc=com ... syncrepl rid=100 provider=ldaps://ldaprw.example.com bindmethod=simple binddn="cn=MySyncUser,dc=example,dc=com" credentials=NotMyRealPassword searchbase="dc=example,dc=com" logbase="cn=accesslog" logfilter="(&(objectclass=auditWriteObject)(reqResult=0))" schemachecking=on type=refreshAndPersist retry=30,+ syncdata=accesslog
Thanks,
--On Wednesday, October 17, 2007 3:58 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
I have a master and nine delta-syncrepl replicas all running on RHEL 4 using OpenLDAP 2.3.38 on the master and all replicas.
The master is configured to purge entries from the accesslog after 12 hours (and checks every 2 hours). I forget where I grabbed that from why I used it.
Yeah, that seems a little short.. The shortest I've dropped a master to is 4 days. You may have discovered an undocumented feature :P
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Wednesday, October 17, 2007 3:58 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
I have a master and nine delta-syncrepl replicas all running on RHEL 4 using OpenLDAP 2.3.38 on the master and all replicas.
The master is configured to purge entries from the accesslog after 12 hours (and checks every 2 hours). I forget where I grabbed that from why I used it.
Yeah, that seems a little short.. The shortest I've dropped a master to is 4 days. You may have discovered an undocumented feature :P
Not a feature. When the replica is out of date with the log it's supposed to get a reloadHint from the master and fallback to plain syncrepl against the main DB, all automatically. If that isn't happening, then there's a bug.
I'm sorry I haven't gotten back to you on this before now.
I've done some testing and I need a pointer or two to see if I've got something wonky or a bug.
On my master server, the accesslog overlay has a contextCSN value for the accesslog database that has the value of when the slapd was started. It doesn't contain data that far back.
Here's LDIF for the accesslog suffix:
dn: cn=accesslog objectClass: auditContainer cn: accesslog entryCSN: 20070710160157Z#000000#00#000000 structuralObjectClass: auditContainer contextCSN: 20071010225200Z#000000#00#000000
and the oldest information left in the accesslog db has an entryCSN value of 20071022225634Z#000000#00#000000.
First, shouldn't the accesslog db contextCSN be updated to indicate the oldest information still present?
Second, does the syncprov overlay look at the contextCSN of the accesslog db or of the main db? My assumption is it is checking the accesslog db (and hence, doesn't think it needs to send a hint because the accesslog db contextCSN is much older than the contextCSN from the dc=example,dc=com I slapadded on the replica).
Third, I set up a test replica using loglevel any and I do not see any indication that any hint was provided. I've never gone looking for one before -- so, I'm not certain what I'm looking for. However, I see indication that syncrepl initialized and then started feeding in the updates from the accesslog on the master. Did I miss it, is it not logged, or am I just blind again?
Finally, looking at the accesslog.c overlay source, I see where the contextCSN is set from the "main DB" during the original open. However, I have not found any indication that the contextCSN is updated during the subsequent logpurge actions (which may have more to do with I haven't found the code that does the purge action yet than anything else). However, given my current contextCSN in the accesslog db matches the last time slapd was started and the oldest data actually present is from just over 12 hours ago, I'm suspicious.
Thanks,
Francis Swasey wrote:
I'm sorry I haven't gotten back to you on this before now.
I've done some testing and I need a pointer or two to see if I've got something wonky or a bug.
On my master server, the accesslog overlay has a contextCSN value for the accesslog database that has the value of when the slapd was started. It doesn't contain data that far back.
Here's LDIF for the accesslog suffix:
dn: cn=accesslog objectClass: auditContainer cn: accesslog entryCSN: 20070710160157Z#000000#00#000000 structuralObjectClass: auditContainer contextCSN: 20071010225200Z#000000#00#000000
and the oldest information left in the accesslog db has an entryCSN value of 20071022225634Z#000000#00#000000.
First, shouldn't the accesslog db contextCSN be updated to indicate the oldest information still present?
No. The contextCSN always indicates the *newest* information.
Second, does the syncprov overlay look at the contextCSN of the accesslog db or of the main db? My assumption is it is checking the accesslog db (and hence, doesn't think it needs to send a hint because the accesslog db contextCSN is much older than the contextCSN from the dc=example,dc=com I slapadded on the replica).
There is no "*the* syncprov overlay" here. There should be an instance on the main db and another instance on the log db. Each one manages its own respective database. If you only have one syncprov configured, then that's the cause of your problem.
On 10/23/07 5:15 PM, Howard Chu wrote:
Francis Swasey wrote:
I'm sorry I haven't gotten back to you on this before now.
I've done some testing and I need a pointer or two to see if I've got something wonky or a bug.
On my master server, the accesslog overlay has a contextCSN value for the accesslog database that has the value of when the slapd was started. It doesn't contain data that far back.
Here's LDIF for the accesslog suffix:
dn: cn=accesslog objectClass: auditContainer cn: accesslog entryCSN: 20070710160157Z#000000#00#000000 structuralObjectClass: auditContainer contextCSN: 20071010225200Z#000000#00#000000
and the oldest information left in the accesslog db has an entryCSN value of 20071022225634Z#000000#00#000000.
First, shouldn't the accesslog db contextCSN be updated to indicate the oldest information still present?
No. The contextCSN always indicates the *newest* information.
No. The contextCSN will only change when the accesslog overlay is opened or whenever syncprov checkpoints (and since I forgot to specify a checkpoint value for the syncprov overlay in the accesslog db, that's why that contextCSN was not getting updated).
Second, does the syncprov overlay look at the contextCSN of the accesslog db or of the main db? My assumption is it is checking the accesslog db (and hence, doesn't think it needs to send a hint because the accesslog db contextCSN is much older than the contextCSN from the dc=example,dc=com I slapadded on the replica).
There is no "*the* syncprov overlay" here. There should be an instance on the main db and another instance on the log db. Each one manages its own respective database. If you only have one syncprov configured, then that's the cause of your problem.
I have syncprov overlay configured (likely misconfigured) for both the accesslog db and the "main" db. I have updated the configuration to specify the checkpoint control for the syncprov on the accesslog db and have now specified the syncprov-reloadhint TRUE option in the main db as well. I am still not seeing any sign in the syslog output that a hint is going to the replica when I load a three day old copy of the main db and tell it to syncrepl against my 12 hour accesslog.
--On October 24, 2007 3:22:21 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Please post your configs.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On 10/24/07 4:36 PM, Quanah Gibson-Mount wrote:
--On October 24, 2007 3:22:21 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Please post your configs.
They have changed very little since the first posting. Here they are though (as attachments)
Francis Swasey wrote:
On 10/24/07 4:36 PM, Quanah Gibson-Mount wrote:
--On October 24, 2007 3:22:21 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Please post your configs.
They have changed very little since the first posting. Here they are though (as attachments)
Your "retry" setting looks strange. Can you fire up slapd with -d 64 (config file processing) and check:
"If an error occurs during replication, the consumer will attempt to reconnect according to the retry parameter which is a list of the <retry interval> and <# of retries> pairs. For example, retry="60 10 300 3" lets the consumer retry every 60 seconds for the first 10 times and then retry every 300 seconds for the next 3 times before stop retrying. The ‘+’ in <# of retries> means indefinite number of retries until success."
Yours is:
retry=30,+
--On Thursday, October 25, 2007 10:23 PM +0100 Gavin Henry ghenry@OpenLDAP.org wrote:
retry=30,+
Yeah, that isn't valid. It'd have to be:
retry="30 +"
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
<quote who="Quanah Gibson-Mount">
--On Thursday, October 25, 2007 10:23 PM +0100 Gavin Henry ghenry@OpenLDAP.org wrote:
retry=30,+
Yeah, that isn't valid. It'd have to be:
retry="30 +"
No wonder they all get out of sync then... strange.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc
Zimbra :: the leader in open source messaging and collaboration
--On Thursday, October 25, 2007 2:40 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Thursday, October 25, 2007 10:23 PM +0100 Gavin Henry ghenry@OpenLDAP.org wrote:
retry=30,+
Yeah, that isn't valid. It'd have to be:
retry="30 +"
Never mind, Howard says the parser accepts comma, space, or tab. So it is perfectly fine, even though the docs only reference space.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Thursday, October 25, 2007 11:05 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
For my config, for the master accesslog DB portion, I have:
checkpoint <size> <time>
configured.
I do not have the:
syncprov-checkpoint
configured. For both database definitions, I would move your limits portion up so that the overlay XYZ bits are the very last things to be listed, as is advised with all 2.3 releases.
On your replica, I would relocate the updateref bit before the overlay statement. Overlay statements should always be the last part of a database definition in 2.3.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On 10/25/07 6:01 PM, Quanah Gibson-Mount wrote:
--On Thursday, October 25, 2007 11:05 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
For my config, for the master accesslog DB portion, I have:
checkpoint <size> <time>
configured.
I do not have the:
syncprov-checkpoint
configured.
You don't? Does your accesslog DB keep it's contextCSN updated while the server runs or does it only update the contextCSN when you start slapd?
For both database definitions, I would move your limits portion up so that the overlay XYZ bits are the very last things to be listed, as is advised with all 2.3 releases.
Ok... perhaps that little tidbit could be made more prominent (instead of being a note in the last sentence of the overlay statement description in the slapd.conf man page). To be honest, I have been reading the slapo-XYZ man pages, not looking at the slapd.conf man page for the overlay settings. I had not noticed that note before going back to check just before sending this message.
On your replica, I would relocate the updateref bit before the overlay statement. Overlay statements should always be the last part of a database definition in 2.3.
I have made the changes you suggested and I will let it run overnight and see if the changes that were made to the master between the time of the slapcat at 7pm on the 21st and 10am this morning will get syncrepl'd over to the replica or not.
Thanks for the guidance on the slapd.conf file.
--On Thursday, October 25, 2007 9:59 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
You don't? Does your accesslog DB keep it's contextCSN updated while the server runs or does it only update the contextCSN when you start slapd?
It updates the context CSN every time a modification is made. The syncprov checkpoint has nothing to do with that behavior.
Fresh DB with no changes:
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "" -s base contextCSN Enter LDAP Password: dn: contextCSN: 20071027180930Z#000001#00#000000
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "cn=accesslog" -s base contextCSN Enter LDAP Password: dn: cn=accesslog contextCSN: 20071027180930Z#000000#00#000000
As you can see, the context CSN's match.
Now, I add a user.
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "" -s base contextCSN Enter LDAP Password: dn: contextCSN: 20071027181210Z#000000#00#000000
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "cn=accesslog" -s base contextCSN Enter LDAP Password: dn: cn=accesslog contextCSN: 20071027181210Z#000000#00#000000
As you can see, the contextCSN's still match.
And, here's the contextCSN on the replica:
[zimbra@tribes ~]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "" -s base contextCSN Enter LDAP Password: dn: contextCSN: 20071027181210Z#000000#00#000000
Everything's happy.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Saturday, October 27, 2007 11:14 AM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
Everything's happy.
Two other notes.
(a) Stopping, then restarting, all CSN's are on sync in the master:
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "cn=accesslog" -s base contextCSN Enter LDAP Password: dn: cn=accesslog contextCSN: 20071027181210Z#000000#00#000000
[zimbra@freelancer libexec]$ ldapsearch -LLL -x -D "uid=zimbra,cn=admins,cn=zimbra" -h freelancer -W -b "cn=accesslog" -s base contextCSN Enter LDAP Password: dn: cn=accesslog contextCSN: 20071027181210Z#000000#00#000000
and note that neither is the startup time, both are the time of the last modification, as it should be.
(b) I enable the syncprov overlay on my replicas.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On 10/27/07 2:21 PM, Quanah Gibson-Mount wrote:
--On Saturday, October 27, 2007 11:14 AM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
Everything's happy.
Two other notes.
(a) Stopping, then restarting, all CSN's are on sync in the master:
<snip>
and note that neither is the startup time, both are the time of the last modification, as it should be.
/sigh -- I am seeing the same thing now. The only thought I have about why the accesslog db contextCSN was so out of date the first time I posted is that the statement ordering in the slapd.conf you helped me with was causing a problem.
I also was assuming it was related to startup time -- because it had the date and rough time stamp of when the slapd had last been started. However, since there is a branch of the tree that has a ton of writes (sendmail spam prevention) (/bigsigh), that was a bad assumption to state as fact. What I really should have said was that the accesslog db appeared to have the contextCSN that was set when the slapd process was last started on the master server. Not that the contextCSN was the start time. It was 2007101022xxxxZ -- and I have tossed that ldif file, so I can't be any surer than that -- the important thing is that the slapd had been restarted the evening of 20071010 and the accesslog db contextCSN was way out of date when I checked on the 22nd.
I have left my hourly checking of the contextCSN values running all weekend and the contextCSN of the main db and the accesslog db have stayed in sync.
(b) I enable the syncprov overlay on my replicas.
Is this required, or just to allow you to use them as syncrepl providers in the future? Do you also run the accesslog overlay on your replicas or just the syncprov overlay in the main db?
(c) I am still seeing the absence of hint activity. Unless someone has a way for me to catch it with a loglevel, I will have to dig into the source and sort this out (which with my other duties is a real drag this week).
--On Monday, October 29, 2007 10:17 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
(b) I enable the syncprov overlay on my replicas.
Is this required, or just to allow you to use them as syncrepl providers in the future? Do you also run the accesslog overlay on your replicas or just the syncprov overlay in the main db?
Just the syncprov overlay in the main db. At the moment I don't recall why I added it, but it was for a useful purpose. ;)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Monday, October 29, 2007 4:45 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 10:17 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Hi Frank,
To follow up on this some more. Last week, I set up a master with an initial db that had a single user. I configured the accesslog DB to check every 30 minutes, and delete all entries older than 2 hours. I then slapcat'd that initial DB.
Then I added a second user.
Then I waited for several hours, and verified the accesslog DB was empty.
Then I added a third user.
Then I set up a replica and imported the original DB with a single user into it. After starting it, it caught up to the master, and had all 3 users present. So, I was not able to reproduce your issue, and I was able to verify that at least in this rather simple scenario that:
(a) Entries in the accesslog DB were replicated (b) Entries not in the accesslog DB that had been created after the data export was taken were still properly added to the replica.
Hope that helps, Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On November 5, 2007 9:11:01 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 4:45 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 10:17 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Hi Frank,
To follow up on this some more.
And some more... We hit this at a customer site with 2.3.37. So, now it's been reproduced. However, it looks like the issue is ITS#5177, which was fixed in OpenLDAP 2.3.39.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On 11/27/07 11:18 AM, Quanah Gibson-Mount wrote:
--On November 5, 2007 9:11:01 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 4:45 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 10:17 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Hi Frank,
To follow up on this some more.
And some more... We hit this at a customer site with 2.3.37. So, now it's been reproduced. However, it looks like the issue is ITS#5177, which was fixed in OpenLDAP 2.3.39.
Ok -- I'm at 2.3.39 now on both the consumer and the producer. When I get a moment to breath, I'll see if I can reproduce it now (I still could when my producer was 2.3.37 and my consumer was 2.3.39).
I'll look at the ITS too.
Thanks!
Frank
Francis Swasey wrote:
On 11/27/07 11:18 AM, Quanah Gibson-Mount wrote:
--On November 5, 2007 9:11:01 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 4:45 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Monday, October 29, 2007 10:17 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:
Hi Frank,
To follow up on this some more.
And some more... We hit this at a customer site with 2.3.37. So, now it's been reproduced. However, it looks like the issue is ITS#5177, which was fixed in OpenLDAP 2.3.39.
Ok -- I'm at 2.3.39 now on both the consumer and the producer. When I get a moment to breath, I'll see if I can reproduce it now (I still could when my producer was 2.3.37 and my consumer was 2.3.39).
The relevant fix was on the provider.
I'll look at the ITS too.
openldap-software@openldap.org