Hi
Some time ago we switched our MMR config to "logsuccess FALSE" to let us record and investigate some (client-side) issues with unique constraint.
This records updates refused by unique constraint in the accesslog, so we can look into them. We have logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" in each syncrepl statement, so those failed updates don't propagate through the replication.
We noticed though that other failed updates, not due to unique constraint, such as adding an already existing object, or deleting a non-existing one, do cause a contextCSN update *on all replica's* but *not* in the provider, leading to contextCSN divergence - at least when no further updates happen on that provider. This usually remains hidden as other updates to the same provider keep increasing its contextCSN, but becomes apparant when no further updates happen on that provider (in this case because we were switching providers).
Eg. when adding an object that already exists, the provider refused the update (error 68, already exists) and logs the following to the accesslog:
dn: reqStart=20220616095229.000003Z,cn=accesslog objectClass: auditAdd reqStart: 20220616095229.000003Z reqEnd: 20220616095229.000004Z reqType: add reqSession: 1498 reqAuthzID: <...> reqDN: <some already existing object> reqResult: 68 reqMod: <...> reqMod: entryCSN:+ 20220616095229.618035Z#000000#001#000000
and its main db is not updated.
But the replica's log the following:
dn: reqStart=20220616095229.000006Z,cn=accesslog objectClass: auditModify reqStart: 20220616095229.000006Z reqEnd: 20220616095229.000007Z reqType: modify reqSession: 1 reqAuthzID: <...> reqDN: <root DN> reqResult: 0 reqMod: contextCSN:= 20220616095229.618035Z#000000#001#000000
And thus update the contextCSN of their root DN (despite no other changes to the main db) - their contextCSN is now different than the provider's.
This obviously only happens with "logsuccess FALSE", but even in that case this should not be expected behaviour; refused updates should not change anything except getting logged to the accesslog?
This is with OpenLDAP 2.5.12 (migrating from 2.4.x)
Geert
On Thu, Jun 16, 2022 at 12:28:27PM +0200, Geert Hendrickx wrote:
And thus update the contextCSN of their root DN (despite no other changes to the main db) - their contextCSN is now different than the provider's.
This obviously only happens with "logsuccess FALSE", but even in that case this should not be expected behaviour; refused updates should not change anything except getting logged to the accesslog?
This is with OpenLDAP 2.5.12 (migrating from 2.4.x)
Hi Geert, this sounds a lot like ITS#9641. Could you also check 2.6?
Thanks,
On Thu, Jun 16, 2022 at 12:42:46 +0200, Ondřej Kuzník wrote:
On Thu, Jun 16, 2022 at 12:28:27PM +0200, Geert Hendrickx wrote:
And thus update the contextCSN of their root DN (despite no other changes to the main db) - their contextCSN is now different than the provider's.
This obviously only happens with "logsuccess FALSE", but even in that case this should not be expected behaviour; refused updates should not change anything except getting logged to the accesslog?
This is with OpenLDAP 2.5.12 (migrating from 2.4.x)
Hi Geert, this sounds a lot like ITS#9641. Could you also check 2.6?
Thanks Ondřej, yes that ITS describes the same issue. I can still reproduce it with 2.6.2 as well as master.
Geert
On Thu, Jun 16, 2022 at 01:29:55PM +0200, Geert Hendrickx wrote:
On Thu, Jun 16, 2022 at 12:42:46 +0200, Ondřej Kuzník wrote:
this sounds a lot like ITS#9641. Could you also check 2.6?
Thanks Ondřej, yes that ITS describes the same issue. I can still reproduce it with 2.6.2 as well as master.
Hi Geert, you mean the CSN recorded (a.k.a. "entryCSN:", not "reqMod: entryCSN:=") for these failed ops is still the original one rather than a new local SID? That would mean the ITS#9641 fix is not working and if that's the case, can you provide a test set up that reproduces this?
Thanks,
On Thu, Jun 16, 2022 at 13:50:52 +0200, Ondřej Kuzník wrote:
Hi Geert, you mean the CSN recorded (a.k.a. "entryCSN:", not "reqMod: entryCSN:=") for these failed ops is still the original one rather than a new local SID? That would mean the ITS#9641 fix is not working and if that's the case, can you provide a test set up that reproduces this?
Yes, we see the replica's get an updated contextCSN with SID from the provider, only the provider itself makes no change to its contextCSN and stays with the old one, ie. the consumers appear to be ahead of this provider. So for me, the behaviour of 2.6 and master is still the same as 2.4/2.5.
Regarding your fix, you mean the replica (*each* replica?) will increase the contextCSN for its own SID for this operation? That seems strange to me, for a failed/refused update on another provider, I would expect no change at all on the replica's (and esp. the contextCSN associated with their SID's to never change as long as they receive no updates directly).
The reproduction scenario is simply: MPR setup with "logsuccess FALSE", then ldapdelete an object which does not exist. => contextCSN on the replica's get updated (there is not even an entryCSN involved here).
Geert
On Thu, Jun 16, 2022 at 03:03:53PM +0200, Geert Hendrickx wrote:
Yes, we see the replica's get an updated contextCSN with SID from the provider, only the provider itself makes no change to its contextCSN and stays with the old one, ie. the consumers appear to be ahead of this provider. So for me, the behaviour of 2.6 and master is still the same as 2.4/2.5.
Regarding your fix, you mean the replica (*each* replica?) will increase the contextCSN for its own SID for this operation? That seems strange to me, for a failed/refused update on another provider, I would expect no change at all on the replica's (and esp. the contextCSN associated with their SID's to never change as long as they receive no updates directly).
Yes, exactly. The accesslog is a DB like any other and you've chosen to add the entries that need an entryCSN of their own. Don't see a way out of that? On a pure replica it's just not involved in replication, on a provider, the accesslog shouldn't stay ahead for long? The NEW_COOKIE message should loop back to it AFAIK, but would need to test that.
Regards,
On Thu, Jun 16, 2022 at 15:20:39 +0200, Ondřej Kuzník wrote:
On Thu, Jun 16, 2022 at 03:03:53PM +0200, Geert Hendrickx wrote:
Yes, we see the replica's get an updated contextCSN with SID from the provider, only the provider itself makes no change to its contextCSN and stays with the old one, ie. the consumers appear to be ahead of this provider. So for me, the behaviour of 2.6 and master is still the same as 2.4/2.5.
Regarding your fix, you mean the replica (*each* replica?) will increase the contextCSN for its own SID for this operation? That seems strange to me, for a failed/refused update on another provider, I would expect no change at all on the replica's (and esp. the contextCSN associated with their SID's to never change as long as they receive no updates directly).
Yes, exactly. The accesslog is a DB like any other and you've chosen to add the entries that need an entryCSN of their own. Don't see a way out of that? On a pure replica it's just not involved in replication, on a provider, the accesslog shouldn't stay ahead for long? The NEW_COOKIE message should loop back to it AFAIK, but would need to test that.
No, I'm speaking of the ContextCSN of the *main DIT*, not of the accesslog. A no-op like deleting a non-existing object updates the ContextCSN of the main db on the replica's but not on the provider. We monitor (the diff between) the contextCSN's so it's indicates the provider is "behind" the consumers here.
Geert
On Thu, Jun 16, 2022 at 03:52:32PM +0200, Geert Hendrickx wrote:
On Thu, Jun 16, 2022 at 15:20:39 +0200, Ondřej Kuzník wrote:
Yes, exactly. The accesslog is a DB like any other and you've chosen to add the entries that need an entryCSN of their own. Don't see a way out of that? On a pure replica it's just not involved in replication, on a provider, the accesslog shouldn't stay ahead for long? The NEW_COOKIE message should loop back to it AFAIK, but would need to test that.
No, I'm speaking of the ContextCSN of the *main DIT*, not of the accesslog. A no-op like deleting a non-existing object updates the ContextCSN of the main db on the replica's but not on the provider. We monitor (the diff between) the contextCSN's so it's indicates the provider is "behind" the consumers here.
I have tried reproducing this and the only DB that has contextCSN changes is the accesslog DB, the main DB stays intact. Please provide an example setup with 2.6 or master where you're seeing the above behaviour. Are you sure it's not something like ppolicy that makes changes to the local DB instead?
Thanks,
On Fri, Jun 17, 2022 at 13:10:57 +0200, Ondřej Kuzník wrote:
I have tried reproducing this and the only DB that has contextCSN changes is the accesslog DB, the main DB stays intact. Please provide an example setup with 2.6 or master where you're seeing the above behaviour. Are you sure it's not something like ppolicy that makes changes to the local DB instead?
I provided a minimal config + reproduction scenario to Ondřej off-list.
Hope he can reproduce it now (or I am misunderstanding something)
Geert
openldap-technical@openldap.org