delta-syncrepl missing changes

List overview All Threads
Download

newer

older

Why Use accesslog overlay?

Re: slapadd performance

Francis Swasey

30 Jan 2009 30 Jan '09

6:45 a.m.

I am running OpenLDAP 2.3.39 (locally built) on Red Hat Enterprise Linux 4 servers with several replicas. We use delta-syncrepl to keep the replicas in sync with the master server.

We also use nagios and monitor the contextcsn value on the replica and alert if it gets too far out of sync with the master server.

The issue we have now experienced a few times is that if there are a LOT of updates in the nightly batch update process, that not all of the updates make it to the replicas but the contextcsn stays in sync, so we see strange errors that eventually lead us to see that the replicas are not current even though they think they are.

Is this a known issue? I haven't found a syslog entry on the server or the replicas that makes me think it is the flag of the root cause.

I have downloaded and built the 2.3.43 release, having installed it on one replica. That replica is just as out of date this morning as the others -- so, if there was a solution between 2.3.39 and 2.3.43 -- it must have been on the provider side not the consumer side.

Thanks for any insight.

-- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Show replies by date

Quanah Gibson-Mount

30 Jan 30 Jan

10:09 a.m.

--On Friday, January 30, 2009 9:45 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

I am running OpenLDAP 2.3.39 (locally built) on Red Hat Enterprise Linux 4 servers with several replicas. We use delta-syncrepl to keep the replicas in sync with the master server.

We also use nagios and monitor the contextcsn value on the replica and alert if it gets too far out of sync with the master server.

The issue we have now experienced a few times is that if there are a LOT of updates in the nightly batch update process, that not all of the updates make it to the replicas but the contextcsn stays in sync, so we see strange errors that eventually lead us to see that the replicas are not current even though they think they are.

Is this a known issue? I haven't found a syslog entry on the server or the replicas that makes me think it is the flag of the root cause.

I have downloaded and built the 2.3.43 release, having installed it on one replica. That replica is just as out of date this morning as the others -- so, if there was a solution between 2.3.39 and 2.3.43 -- it must have been on the provider side not the consumer side.

Thanks for any insight.

Yes, this was a known issue, and both the replicas and the master must all be on 2.3.43, and all the replicas reloaded from a fresh dump of the master. Once you've done that, it should take care of the problems you are seeing.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

10:55 a.m.

On 1/30/09 1:09 PM, Quanah Gibson-Mount wrote:

...

--On Friday, January 30, 2009 9:45 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
I am running OpenLDAP 2.3.39 (locally built) on Red Hat Enterprise Linux 4 servers with several replicas. We use delta-syncrepl to keep the replicas in sync with the master server.

The issue we have now experienced a few times is that if there are a LOT of updates in the nightly batch update process, that not all of the updates make it to the replicas but the contextcsn stays in sync, so we see strange errors that eventually lead us to see that the replicas are not current even though they think they are.

Yes, this was a known issue, and both the replicas and the master must all be on 2.3.43, and all the replicas reloaded from a fresh dump of the master. Once you've done that, it should take care of the problems you are seeing.

Yes, one of my co-workers calls reloading a fresh dump from the master: "nuke and repave" -- and (sadly) we're getting good at it. Which brings up another question. Back in the days of slurpd, we could force a replica to accept a change by using the correct DN to send the change it had missed. Is there any way to do something similar using syncrepl (or delta-syncrepl)? I think the answer is no -- but as long as I'm making a nuisance of myself, I figure I might as well ask.

Thanks, Frank

Quanah Gibson-Mount

1:55 p.m.

--On Friday, January 30, 2009 1:55 PM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

Yes, one of my co-workers calls reloading a fresh dump from the master: "nuke and repave" -- and (sadly) we're getting good at it. Which brings up another question. Back in the days of slurpd, we could force a replica to accept a change by using the correct DN to send the change it had missed. Is there any way to do something similar using syncrepl (or delta-syncrepl)? I think the answer is no -- but as long as I'm making a nuisance of myself, I figure I might as well ask.

I'd think you could use the -c option to slapd to give the replica a really old cookie and force it to fall back to doing a fully syncrepl refresh, but that's going to take a lot lot longer than slapcat/slapadd.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

18 Feb 18 Feb

6:14 a.m.

On 1/30/09 1:09 PM, Quanah Gibson-Mount wrote:

...

--On Friday, January 30, 2009 9:45 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
I am running OpenLDAP 2.3.39 (locally built) on Red Hat Enterprise Linux 4 servers with several replicas. We use delta-syncrepl to keep the replicas in sync with the master server.

Yes, this was a known issue, and both the replicas and the master must all be on 2.3.43, and all the replicas reloaded from a fresh dump of the master. Once you've done that, it should take care of the problems you are seeing.

I upgraded everything to 2.3.43 on January 31 and verified at the time that everything was in sync. However, it's been a week or more since I last audited the sync'ness of the replicas and the master. This morning, I've discovered that the replica's are again out of date with the master copy and there are of course no failures in the syncrepl_message_to_op's logged on any of the replicas that would indicate why these 36 modification's were not replicated.

So, what ITS was this supposed to be fixed by?

Is there anything that should be logged that would help identify the failure (I'm currently using loglevel of "stats sync" on the master and all the replicas) ?

Thanks,

Francis Swasey

6:55 a.m.

On 2/18/09 9:14 AM, Francis Swasey wrote:

...

On 1/30/09 1:09 PM, Quanah Gibson-Mount wrote:

...
--On Friday, January 30, 2009 9:45 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
I am running OpenLDAP 2.3.39 (locally built) on Red Hat Enterprise Linux 4 servers with several replicas. We use delta-syncrepl to keep the replicas in sync with the master server.

Yes, this was a known issue, and both the replicas and the master must all be on 2.3.43, and all the replicas reloaded from a fresh dump of the master. Once you've done that, it should take care of the problems you are seeing.

I upgraded everything to 2.3.43 on January 31 and verified at the time that everything was in sync. However, it's been a week or more since I last audited the sync'ness of the replicas and the master. This morning, I've discovered that the replica's are again out of date with the master copy and there are of course no failures in the syncrepl_message_to_op's logged on any of the replicas that would indicate why these 36 modification's were not replicated.

So, what ITS was this supposed to be fixed by?

Is there anything that should be logged that would help identify the failure (I'm currently using loglevel of "stats sync" on the master and all the replicas) ?

Some further digging into this and I see that the changes this morning to at least one of these entries are not present in the accesslog database. No wonder the change didn't make it to the replica's, it didn't even make it into the accesslog on the master (although auditlog sees the change and the dc=uvm,dc=edu database on the master has the change).

Any suggestions on where to look in the accesslog overlay to see why these modify operations are not being recorded?

Thanks,

Quanah Gibson-Mount

9:07 a.m.

--On Wednesday, February 18, 2009 9:55 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

...
Is there anything that should be logged that would help identify the failure (I'm currently using loglevel of "stats sync" on the master and all the replicas) ?

Some further digging into this and I see that the changes this morning to at least one of these entries are not present in the accesslog database. No wonder the change didn't make it to the replica's, it didn't even make it into the accesslog on the master (although auditlog sees the change and the dc=uvm,dc=edu database on the master has the change).

Any suggestions on where to look in the accesslog overlay to see why these modify operations are not being recorded?

Well, I'd generally check db_stat -c to make sure you didn't run out of locks/lockers/lock objects in the accesslog DB to start. I assume you have reasonable logging on the master, so that you can see if any errors were thrown when the MODs that didn't get written out occurred, etc.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

10:55 a.m.

On 2/18/09 12:07 PM, Quanah Gibson-Mount wrote:

...

--On Wednesday, February 18, 2009 9:55 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
...
Is there anything that should be logged that would help identify the failure (I'm currently using loglevel of "stats sync" on the master and all the replicas) ?

Some further digging into this and I see that the changes this morning to at least one of these entries are not present in the accesslog database. No wonder the change didn't make it to the replica's, it didn't even make it into the accesslog on the master (although auditlog sees the change and the dc=uvm,dc=edu database on the master has the change).

Any suggestions on where to look in the accesslog overlay to see why these modify operations are not being recorded?

Well, I'd generally check db_stat -c to make sure you didn't run out of locks/lockers/lock objects in the accesslog DB to start. I assume you have reasonable logging on the master, so that you can see if any errors were thrown when the MODs that didn't get written out occurred, etc.

The number of locks, lockers, and lock objects are all still at the default 1000. The maximum numbers are 54, 124, and 40 (respectively). So, I think I'm safe there.

As I said, I'm logging stats and sync. The modify that didn't make it into the accesslog happened at 1234942530 (2:35:30 EST this morning). The only "interesting" thing logged around that time was:

connection_input: conn=139425 deferring operation: too many executing

which happened at 2:34:58, 2:35:01, 2:35:04, 2:35:09, 2:35:11, 2:35:12, 2:35:16, 2:35:21, 2:35:27, 2:35:31, and 2:35:35. (a total of 11 times)

conn 139425 was the ldapmodify command which was connected from 2:34:57 until 2:35:38 that was performing the 1321 changes (559 adds, 1 delete, and 761 modifications).

Assuming my loglevel is high enough to catch the problem -- that looks like it.

Francis Swasey

19 Feb 19 Feb

4:39 a.m.

On 2/18/09 1:55 PM, Francis Swasey wrote:

...

On 2/18/09 12:07 PM, Quanah Gibson-Mount wrote:

...
--On Wednesday, February 18, 2009 9:55 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
...
Is there anything that should be logged that would help identify the failure (I'm currently using loglevel of "stats sync" on the master and all the replicas) ?

Some further digging into this and I see that the changes this morning to at least one of these entries are not present in the accesslog database. No wonder the change didn't make it to the replica's, it didn't even make it into the accesslog on the master (although auditlog sees the change and the dc=uvm,dc=edu database on the master has the change).

Any suggestions on where to look in the accesslog overlay to see why these modify operations are not being recorded?

Well, I'd generally check db_stat -c to make sure you didn't run out of locks/lockers/lock objects in the accesslog DB to start. I assume you have reasonable logging on the master, so that you can see if any errors were thrown when the MODs that didn't get written out occurred, etc.

The number of locks, lockers, and lock objects are all still at the default 1000. The maximum numbers are 54, 124, and 40 (respectively). So, I think I'm safe there.

As I said, I'm logging stats and sync. The modify that didn't make it into the accesslog happened at 1234942530 (2:35:30 EST this morning). The only "interesting" thing logged around that time was:

connection_input: conn=139425 deferring operation: too many executing

which happened at 2:34:58, 2:35:01, 2:35:04, 2:35:09, 2:35:11, 2:35:12, 2:35:16, 2:35:21, 2:35:27, 2:35:31, and 2:35:35. (a total of 11 times)

conn 139425 was the ldapmodify command which was connected from 2:34:57 until 2:35:38 that was performing the 1321 changes (559 adds, 1 delete, and 761 modifications).

Assuming my loglevel is high enough to catch the problem -- that looks like it.

The deferring operation messages do not seem to be related. There were several of those same messages this morning, but my audit of all the replicas this morning shows that none of them are missing any information.

Any suggestions about where to look for more information or a different loglevel to use on the master to catch this?

Francis Swasey

11:32 a.m.

I have duplicated the configuration of the master system on another machine and re-running the 1321 modifications from yesterday morning, I am unable to reproduce the failure to add entries into the accesslog database.

So, as I was afraid, this problem requires an interaction between this modify and some other sequence of changes. The most likely culprit in my environment is our mail gateways that store records in our ldap server to block delivery from certain "high volume" systems... It is quite conceivable that a large influx of changes from those systems was possible at the same time.

The other major difference was that I started this test with the accesslog database being empty. I find that "slapcat" of the cn=accesslog database doesn't include the dn's. So, slapadd can't load them back in. Hmm... and after doing the test on the other machine, the slapcat (on that machine) does include the dn value.... Perhaps I have found my problem.

I guess I'm going to shutdown the master and wipe the accesslog database tonight. See if this happens again :(

Francis Swasey

3 Mar 3 Mar

10:20 a.m.

This problem has come back. Yesterday, one of my auxilliary feeds messed up and zapped 18020 entries removing a certain data value. I put the value back, but in performing an audit this morning, I discovered that 67 of the repairs had not made it to the replicas.

I have repaired the 67 now, by reversing the change and then re-applying the change on the master server.

I assume this is some kind of timing bug. What info can I provide to help find this?

I am running 2.3.43 on Red Hat Enterprise Linux v4.

Thanks,

Howard Chu

11:50 a.m.

Francis Swasey wrote:

...

This problem has come back. Yesterday, one of my auxilliary feeds messed up and zapped 18020 entries removing a certain data value. I put the value back, but in performing an audit this morning, I discovered that 67 of the repairs had not made it to the replicas.

I have repaired the 67 now, by reversing the change and then re-applying the change on the master server.

I assume this is some kind of timing bug. What info can I provide to help find this?

I am running 2.3.43 on Red Hat Enterprise Linux v4.

At this point, not much until you upgrade to 2.4. 2.3 hasn't been updated in 8 months and is about to be retired.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Francis Swasey

4 Mar 4 Mar

4:44 a.m.

On 3/3/09 2:50 PM, Howard Chu wrote:

...

Francis Swasey wrote:

...

...
I assume this is some kind of timing bug. What info can I provide to help find this?

I am running 2.3.43 on Red Hat Enterprise Linux v4.

At this point, not much until you upgrade to 2.4. 2.3 hasn't been updated in 8 months and is about to be retired.

You may consider the 2.4 branch stable -- but I'm still seeing way too many problems being reported to consider any release of 2.4 for production work.

If this is the worst thing I can find wrong with 2.3.43, then I'll stay on 2.3 until the major issues with 2.4 go away.

Michael Ströder

11:24 a.m.

Francis Swasey wrote:

...

On 3/3/09 2:50 PM, Howard Chu wrote:

...
Francis Swasey wrote:

...
...
I assume this is some kind of timing bug. What info can I provide to help find this?

I am running 2.3.43 on Red Hat Enterprise Linux v4.

At this point, not much until you upgrade to 2.4. 2.3 hasn't been updated in 8 months and is about to be retired.

You may consider the 2.4 branch stable -- but I'm still seeing way too many problems being reported to consider any release of 2.4 for production work.

Well, the amount of problems being reported might be higher for 2.4. But this does not say anything about stability of your particular configuration. One of the reasons for more issues being reported for 2.4 probably is that many deployments were already migrated to 2.4. So naturally if less people are using 2.3 there will be less issues reported.

...

If this is the worst thing I can find wrong with 2.3.43, then I'll stay on 2.3 until the major issues with 2.4 go away.

Hmm, personally I wouldn't want to correct entries manually...your mileage may vary.

Ciao, Michael.

Quanah Gibson-Mount

12:01 p.m.

--On Wednesday, March 04, 2009 8:24 PM +0100 Michael Ströder michael@stroeder.com wrote:

...

...
You may consider the 2.4 branch stable -- but I'm still seeing way too many problems being reported to consider any release of 2.4 for production work.

Well, the amount of problems being reported might be higher for 2.4. But this does not say anything about stability of your particular configuration. One of the reasons for more issues being reported for 2.4 probably is that many deployments were already migrated to 2.4. So naturally if less people are using 2.3 there will be less issues reported.

I would say I consider current 2.4.15 more stable than 2.3.43 when using the same feature set. In particular because of numerous fixes for issues that are currently found in 2.3.43. I of course have no idea whether or not any of those fixes addresses your particular issue. I haven't seen this, and all my installs use delta-syncrepl. I'm wondering if it could be related however to ITS#5985. How many replicas do you have total? If there are ones in particular that fall way behind, do they end up falling back into full refresh mode?

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

6:31 p.m.

On 3/4/09 3:01 PM, Quanah Gibson-Mount wrote:

...

--On Wednesday, March 04, 2009 8:24 PM +0100 Michael StrÃ¶der michael@stroeder.com wrote:

...
...
You may consider the 2.4 branch stable -- but I'm still seeing way too many problems being reported to consider any release of 2.4 for production work.

Well, the amount of problems being reported might be higher for 2.4. But this does not say anything about stability of your particular configuration. One of the reasons for more issues being reported for 2.4 probably is that many deployments were already migrated to 2.4. So naturally if less people are using 2.3 there will be less issues reported.

I would say I consider current 2.4.15 more stable than 2.3.43 when using the same feature set. In particular because of numerous fixes for issues that are currently found in 2.3.43. I of course have no idea whether or not any of those fixes addresses your particular issue. I haven't seen this, and all my installs use delta-syncrepl. I'm wondering if it could be related however to ITS#5985. How many replicas do you have total? If there are ones in particular that fall way behind, do they end up falling back into full refresh mode?

I have a total of ten replica systems. All ten will wind up missing the same updates. They don't realize that they have missed them. Their contextcsn syncs up with the one on the master server.

I believe the changes are not being placed in the delta database (cn=accesslog).

I'll be sure to check the accesslog database the next time the replicas miss changes and see if those changes are listed there or not.

Francis Swasey

6 Mar 6 Mar

4:30 a.m.

On 3/4/09 9:31 PM, Francis Swasey wrote:

...

I believe the changes are not being placed in the delta database (cn=accesslog).

I'll be sure to check the accesslog database the next time the replicas miss changes and see if those changes are listed there or not.

This hypothesis has been confirmed with the modifications that didn't make it to the replicas yesterday morning and this morning.

Looks like the race conditions in accesslog/syncrepl that are being worked on in the ITS's against 2.4 were present in the later 2.3 systems as well.

Quanah Gibson-Mount

9 Mar 9 Mar

11:57 a.m.

--On Friday, March 06, 2009 7:30 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

This hypothesis has been confirmed with the modifications that didn't make it to the replicas yesterday morning and this morning.

Looks like the race conditions in accesslog/syncrepl that are being worked on in the ITS's against 2.4 were present in the later 2.3 systems as well.

I'm not sure what race conditions you're referring to here. Using accesslog serializes writes, therefore you can't hit the race conditions about updating the CSN that are being worked on in current RE24.

I'll note that my 2.3.43 builds are actually a hybrid with some 2.4 code. I don't believe there's anything accesslog specific in it, but there is some connection code reworking that is a part of it (the lightweight dispatcher from 2.4, most notably). I've yet to see this with any of our deployments and some of them are quite large and extremely busy.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

10 Mar 10 Mar

7:49 a.m.

On 3/9/09 2:57 PM, Quanah Gibson-Mount wrote:

...

--On Friday, March 06, 2009 7:30 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
This hypothesis has been confirmed with the modifications that didn't make it to the replicas yesterday morning and this morning.

Looks like the race conditions in accesslog/syncrepl that are being worked on in the ITS's against 2.4 were present in the later 2.3 systems as well.

I'm not sure what race conditions you're referring to here. Using accesslog serializes writes, therefore you can't hit the race conditions about updating the CSN that are being worked on in current RE24.

I'll note that my 2.3.43 builds are actually a hybrid with some 2.4 code. I don't believe there's anything accesslog specific in it, but there is some connection code reworking that is a part of it (the lightweight dispatcher from 2.4, most notably). I've yet to see this with any of our deployments and some of them are quite large and extremely busy.

To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Quanah Gibson-Mount

9:20 a.m.

--On Tuesday, March 10, 2009 10:49 AM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

On 3/9/09 2:57 PM, Quanah Gibson-Mount wrote:

...
--On Friday, March 06, 2009 7:30 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
This hypothesis has been confirmed with the modifications that didn't make it to the replicas yesterday morning and this morning.

Looks like the race conditions in accesslog/syncrepl that are being worked on in the ITS's against 2.4 were present in the later 2.3 systems as well.

I'm not sure what race conditions you're referring to here. Using accesslog serializes writes, therefore you can't hit the race conditions about updating the CSN that are being worked on in current RE24.

I'll note that my 2.3.43 builds are actually a hybrid with some 2.4 code. I don't believe there's anything accesslog specific in it, but there is some connection code reworking that is a part of it (the lightweight dispatcher from 2.4, most notably). I've yet to see this with any of our deployments and some of them are quite large and extremely busy.

To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Oh, you have auditlog in place too? I don't believe you mentioned that before. I bet it is related to them both being enabled.

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

10:08 a.m.

On 3/10/09 12:20 PM, Quanah Gibson-Mount wrote:

...

--On Tuesday, March 10, 2009 10:49 AM -0400 Francis Swasey

...

...
To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Oh, you have auditlog in place too? I don't believe you mentioned that before. I bet it is related to them both being enabled.

It is the first time I've mentioned it in this thread, but I've mentioned it in previous threads.

So -- why would having auditlog and accesslog (and syncprov) all used with a database cause accesslog to miss some of the changes?

Quanah Gibson-Mount

10:19 a.m.

--On Tuesday, March 10, 2009 1:08 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:

...

On 3/10/09 12:20 PM, Quanah Gibson-Mount wrote:

...
--On Tuesday, March 10, 2009 10:49 AM -0400 Francis Swasey

...
...
To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Oh, you have auditlog in place too? I don't believe you mentioned that before. I bet it is related to them both being enabled.

It is the first time I've mentioned it in this thread, but I've mentioned it in previous threads.

So -- why would having auditlog and accesslog (and syncprov) all used with a database cause accesslog to miss some of the changes?

I don't know. :/ But I do know I've never seen it in very write-intensive environments, and that is a major difference.

Do writes to go auditlog before accesslog in your configuration? Maybe there's a bug in auditlog under a high-write load where the changes don't get cleaned up properly. If auditlog is being written first, try reversing the order, so that accesslog gets the changes first (I assume that's possible).

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Francis Swasey

11:17 a.m.

On 3/10/09 1:19 PM, Quanah Gibson-Mount wrote:

...

--On Tuesday, March 10, 2009 1:08 PM -0400 Francis Swasey Frank.Swasey@uvm.edu wrote:

...
On 3/10/09 12:20 PM, Quanah Gibson-Mount wrote:

...
--On Tuesday, March 10, 2009 10:49 AM -0400 Francis Swasey

...
...
To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Oh, you have auditlog in place too? I don't believe you mentioned that before. I bet it is related to them both being enabled.

It is the first time I've mentioned it in this thread, but I've mentioned it in previous threads.

So -- why would having auditlog and accesslog (and syncprov) all used with a database cause accesslog to miss some of the changes?

I don't know. :/ But I do know I've never seen it in very write-intensive environments, and that is a major difference.

Do writes to go auditlog before accesslog in your configuration? Maybe there's a bug in auditlog under a high-write load where the changes don't get cleaned up properly. If auditlog is being written first, try reversing the order, so that accesslog gets the changes first (I assume that's possible).

If I remember correctly that the overlays get control in the reverse order of their listing (ie, the last one listed is the first one executed), then yes, the auditlog is run first, then accesslog, then syncprov.

I'll do some experimenting with the order and see if I can list auditlog first so it runs last.

Francis Swasey

20 Mar 20 Mar

6:41 a.m.

On 3/10/09 2:17 PM, Francis Swasey wrote:

...

If I remember correctly that the overlays get control in the reverse order of their listing (ie, the last one listed is the first one executed), then yes, the auditlog is run first, then accesslog, then syncprov.

I'll do some experimenting with the order and see if I can list auditlog first so it runs last.

I had a chance last night to put the re-ordered overlay list into production. Having auditlog listed first, syncprov second, and accesslog listed last.

This morning, my replica's were out of sync again, so I have now restarted the master server with the auditlog overlay not used (didn't even load the module). If my replica's go out of sync again, then there is definitely a bug in accesslog -- and since the developers are still fighting with replication bugs in 2.4.15 -- I don't see much relief coming any time soon.

Perhaps, someone should undertake to re-implement slurpd -- maybe based on auditlog ;-)

Aaron Richton

6:50 a.m.

On Fri, 20 Mar 2009, Francis Swasey wrote:

...

This morning, my replica's were out of sync again, so I have now restarted

Just out of curiosity, what determines "out of sync" in your environment? I assume you're seeing something more insidious than a simple contextCSN mismatch.

Francis Swasey

7:27 a.m.

On 3/20/09 9:50 AM, Aaron Richton wrote:

...

On Fri, 20 Mar 2009, Francis Swasey wrote:

...
This morning, my replica's were out of sync again, so I have now restarted

Just out of curiosity, what determines "out of sync" in your environment? I assume you're seeing something more insidious than a simple contextCSN mismatch.

I'm seeing contextCSN values that match. I do a slapcat on the replica and compare that to a slapcat from the master and find changes that are missing. I then search the accesslog db (matching the ReqDN value) and find that those changes never were recorded there on the master server.

Quanah's last thought was there was some conflict between the auditlog and accesslog overlays. So, I'm hoping to confirm/refute that by having completely removed the auditlog overlay from the environment today.

Sadly, I can't reproduce the issue at will -- I have to wait for the right sequence of gamma and other radiation to coincidentally happen during a time when there's a high rate of changes (the nightly batch update from the authoritative sources).

Howard Chu

1:22 p.m.

Francis Swasey wrote:

...

On 3/10/09 2:17 PM, Francis Swasey wrote:

...
If I remember correctly that the overlays get control in the reverse order of their listing (ie, the last one listed is the first one executed), then yes, the auditlog is run first, then accesslog, then syncprov.

I'll do some experimenting with the order and see if I can list auditlog first so it runs last.

I had a chance last night to put the re-ordered overlay list into production. Having auditlog listed first, syncprov second, and accesslog listed last.

This morning, my replica's were out of sync again, so I have now restarted the master server with the auditlog overlay not used (didn't even load the module). If my replica's go out of sync again, then there is definitely a bug in accesslog -- and since the developers are still fighting with replication bugs in 2.4.15 -- I don't see much relief coming any time soon.

Those are related to multimaster. Single master delta-sync has not shown any problems in 2.4.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Russell Jackson

11 Mar 11 Mar

10:30 a.m.

Francis Swasey wrote:

...

On 3/10/09 12:20 PM, Quanah Gibson-Mount wrote:

...
--On Tuesday, March 10, 2009 10:49 AM -0400 Francis Swasey

...
...
To be specific, there are changes that make it into the master server, and the auditlog overlay logs them, but the accesslog overlay does NOT put them in the accesslog database, so they do not get sent to the replica servers.

It seems to be some kind of race condition. I haven't figured out a way to reproduce the failure yet.

Oh, you have auditlog in place too? I don't believe you mentioned that before. I bet it is related to them both being enabled.

It is the first time I've mentioned it in this thread, but I've mentioned it in previous threads.

So -- why would having auditlog and accesslog (and syncprov) all used with a database cause accesslog to miss some of the changes?

Why not just have another process suck out the contents of accesslog you're interested in auditing and write it to a log file. Having both overlays is kinda redundant.

-- Russell A. Jackson raj@csub.edu Network Analyst California State University, Bakersfield Life is like a tin of sardines. We're, all of us, looking for the key. -- Beyond the Fringe

Quanah Gibson-Mount

2:58 p.m.

--On Wednesday, March 11, 2009 10:30 AM -0700 Russell Jackson raj@csub.edu wrote:

...

Why not just have another process suck out the contents of accesslog you're interested in auditing and write it to a log file. Having both overlays is kinda redundant.

Regardless, they should still place nice together. ;)

--Quanah

Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

5960

Age (days ago)

6009

Last active (days ago)

openldap-software@openldap.org

28 comments

6 participants

tags (0)

participants (6)

Aaron Richton
Francis Swasey
Howard Chu
Michael Ströder
Quanah Gibson-Mount
Russell Jackson