--On Thursday, September 01, 2016 8:05 AM +0000 quanah(a)zimbra.com wrote:
--On Thursday, September 01, 2016 7:52 AM +0000 quanah(a)openldap.org
wrote:
> Full_Name: Quanah Gibson-Mount
> Version: OpenLDAP 2.4.44
> OS: Linux 2.6
> URL:
ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (75.111.52.177)
>
>
> In a 2-node MMR setup. Node 1 is getting a lot of write traffic. Both
> node 1 and node 2 have 3 replicas each. At some point, a change is
> received by node 1, which writes the change to its accesslog DB and its
> primary DB. It's 3 replicas are all correctly updated. MMR node 2
> receives the change, updates its primary DB, but *fails* to write the
> change to the accesslog DB. However, it *does* write the CSN update to
> the accesslog DB successfully. This causes all of its replicas to also
> update their CSN. Then a change comes in triggering a constraint
> violation on the replicas, but fully accepted by their master.
So the above summary is incorrect. While 3 replicas did go out of
sync... 2 belonged to the primary master (node1), and 1 belonged to the
secondary master (node 2). So really, 4 systems didn't log the change
(MMR node 2, ldap05, ldap07, ldap09).
Ok, so that's not correct either. I now have the correct topography:
ldap01 has the following replicas: ldap02, ldap05, ldap07, ldap09
ldap02 has the following replicas: ldap01, ldap06, ldap08, ldap10
So the replicas of ldap01 received the change and rejected it. ldap02 just
skipped writing the entry to the accesslog, and as a result, none of its
replicas ever got the change, and thus they never hit the failure issue of
err 19, but they all are now lacking this modification entirely.
I would note that every server was loaded today from the same ldap backup,
so they were all perfectly in sync.
In looking at the LDAP accesslog, what I see is that what should have been
a modRDN op was stored in the accesslog as a MOD op (the one I noted
before). This seems particularly bizarre, because ldap01 should have
rejected this change as well. It appears we may have a problem where the
accesslog DB is updated, but then the change got rejected by the unique
overlay.
--Quanah
--
Quanah Gibson-Mount