--On Thursday, June 29, 2017 1:41 PM -0400 btb btb@bitrate.net wrote:
On 6/29/17 11:15 AM, Quanah Gibson-Mount wrote:
--On Thursday, June 29, 2017 2:12 AM -0400 btb btb@bitrate.net wrote:
i see, thanks. i tested this, and did a modify on each, but didn't see replication resume. emulating the syncrepl connection with a manual search against each master, there do seem to be accesslog entries now, on both masters:
You may have to restart the consumers (I did when I ran into this).
i did try a restart on both, but they returned to the same state
Also, there are 2 sets of CSNs per master that you need to examine -- The CSNs in your database root (i.e., dc=example,dc=org) and your accesslog root.
that would be these, right?
dsa1 cn=accesslog: 20161019002438.652359Z#000000#000#000000 20170521175113.974560Z#000000#002#000000 20170530214415.204052Z#000000#001#000000
dsa1 dc=example,dc=org: 20170520031415.276678Z#000000#000#000000 20170530214231.171959Z#000000#002#000000 20170530214415.204052Z#000000#001#000000
dsa2 cn=accesslog: 20170520031415.276678Z#000000#000#000000 20170521175113.974560Z#000000#002#000000 20170628034119.327974Z#000000#001#000000
dsa2 dc=example,dc=org: 20170520031415.276678Z#000000#000#000000 20170619014933.531051Z#000000#002#000000 20170628034119.327974Z#000000#001#000000
why are there three per db, and which is suppose to match which?
wow, that's a mess.
So #000# is serverID 0, which would be for any entries prior to moving to MMR. The fact that you have different values for #000# on dsa1 accesslog vs the other 3 databases is disturbing.
It would appear DSA1 is serverID 1, and its CSNs make sense:
20170530214415.204052Z#000000#001#000000 20170530214415.204052Z#000000#001#000000
However, there's someting serious wrong with dsa2 (assuming it is serverID 2):
20170521175113.974560Z#000000#002#000000 20170619014933.531051Z#000000#002#000000
As this implies the primary DB received a write on 2017/06/19 @ 01:49:33, but the accesslog has not recorded this change, as it says the last time there was a write op to the accesslog DB on #002# was 2017/05/21 @ 17:51:13, nearly a month earlier. So it doesn't seem to think you've done a write op directly against serverID 002.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
wow, that's a mess.
So #000# is serverID 0, which would be for any entries prior to moving to MMR. The fact that you have different values for #000# on dsa1 accesslog vs the other 3 databases is disturbing.
It would appear DSA1 is serverID 1, and its CSNs make sense:
20170530214415.204052Z#000000#001#000000 20170530214415.204052Z#000000#001#000000
However, there's someting serious wrong with dsa2 (assuming it is serverID 2):
20170521175113.974560Z#000000#002#000000 20170619014933.531051Z#000000#002#000000
As this implies the primary DB received a write on 2017/06/19 @ 01:49:33, but the accesslog has not recorded this change, as it says the last time there was a write op to the accesslog DB on #002# was 2017/05/21 @ 17:51:13, nearly a month earlier. So it doesn't seem to think you've done a write op directly against serverID 002.
thanks. i think i've managed to clean up the mess, and replications is flowing again. i've exorcized the old serverid 000 references, and verified each server's accesslog is getting updated as local modifications occur.
contextcsns seem to be a bit more sane now, hopefully?
ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'cn=config' -s base 'olcserverid' Enter LDAP Password: dn: cn=config olcServerID: 1
ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'cn=config' -s base 'olcserverid' Enter LDAP Password: dn: cn=config olcServerID: 2
ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'dc=example,dc=org' -s base 'contextcsn' Enter LDAP Password: dn: dc=example,dc=org contextCSN: 20170705042207.590054Z#000000#001#000000 contextCSN: 20170704183515.872465Z#000000#002#000000
ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'dc=example,dc=org' -s base 'contextcsn' Enter LDAP Password: dn: dc=example,dc=org contextCSN: 20170705042207.590054Z#000000#001#000000 contextCSN: 20170704183515.872465Z#000000#002#000000
ldapsearch -ZZxWLLLH 'ldap://dsa1.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'cn=accesslog' -s base 'contextcsn' Enter LDAP Password: dn: cn=accesslog contextCSN: 20170705042145.957972Z#000000#001#000000 contextCSN: 20170704183515.872465Z#000000#002#000000
ldapsearch -ZZxWLLLH 'ldap://dsa2.example.org/' -D
'uid=dit_admin,ou=role_accounts,ou=accounts,dc=example,dc=org' -b 'cn=accesslog' -s base 'contextcsn' Enter LDAP Password: dn: cn=accesslog contextCSN: 20170705042145.957972Z#000000#001#000000 contextCSN: 20170704183515.872465Z#000000#002#000000
i've also increased accesslog data retention from 7 days to 14 days, as a bit of a compensation for the infrequent writes, and i'll implement a "no-op" cron job as well, as a fail safe. are then any pitfalls i may not be considering with a 14 day accesslog retention period? is that too long according to "typical" consensus?
for posterity's sake, after the mess was cleaned up, once a proper write occurred on each master, and the accesslog db was updated and csns brought in line, replication began flowing again, without the need for a restart on either side [at least in this particular case, anyway].
-ben
openldap-technical@openldap.org