monitoring MMR sync

List overview All Threads
Download

newer

older

Re: [EXT] Re: Replication issues...

Re: consumer replication not...

Dave

25 Dec 2024 25 Dec '24

8:24 a.m.

Hello Everyone,

Hope you are enjoying the day.

Was curious what people are doing to monitor their MMR clusters being out of sync.

At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.

How is the community doing it? Any better way to go about this?

Any input is appreciated!

Best, Dave

Show replies by date

Dirk-Willem van Gulik

25 Dec 25 Dec

10:43 a.m.

On 25 Dec 2024, at 17:25, Dave davama@gmail.com wrote:

...

Was curious what people are doing to monitor their MMR clusters being out of sync

Have you looked at the change sequence number (CSN) ? Eg either script if and check they are identical.

Or use net-snmp and expose it that way. That is what I usually do.

Dw.

Dave Macias

1:44 p.m.

Interesting! Didnt know about that feature. Thank you for the tip! On Dec 25, 2024 at 1:44 PM -0500, Dirk-Willem van Gulik dirkx@webweaving.org, wrote:

...

On 25 Dec 2024, at 17:25, Dave davama@gmail.com wrote:

...
Was curious what people are doing to monitor their MMR clusters being out of sync

Have you looked at the change sequence number (CSN) ? Eg either script if and check they are identical.

Or use net-snmp and expose it that way. That is what I usually do.

Dw.

Stefan Kania

26 Dec 26 Dec

2:11 a.m.

What about slapd-watch? Did you take a look at it?

Am 25.12.24 um 17:24 schrieb Dave:

...

Hello Everyone,

Hope you are enjoying the day.

Was curious what people are doing to monitor their MMR clusters being out of sync.

At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.

How is the community doing it? Any better way to go about this?

Any input is appreciated!

Best, Dave

sacawulu

31 Dec 31 Dec

1:57 a.m.

Hi,

What we are doing is: generate on each ldap node two files that "fingerprint" the local LDAP DIT. These are:

- the contextCSN's of the local DB - a md5sum of a dump of (many) records in the local DB

When replication is doing it's thing, the contents of these two fingerprints (files) should be identical across all nodes.

We then use zabbix-agent2 to collect and compare the fingerprint across the nodes. The value of course often changes, since ldap is in active use. But as long as the values match every 10 minutes or os, everything is working and there it nothing to worry.

This works (for us) surprisingly well. Below are the two bash scripts. You need to define/adjust the variables for your environment, of course

script #1: ldap-contextCSN.sh: <begin> #!/bin/bash

# oktober 2023, by MoHe

# works only as root or sudo if [ "$EUID" -ne 0 ] then echo "Please run as root" exit fi

rm -r /tmp/ldap-contextCSN

$LDAPSEARCH -H ldapi:// -b $BASE -D $LDAPBINDDN -w $ADMINPW -s base contextCSN | grep -v requesting | grep contextCSN > /tmp/ldap-contextCSN <end>

script #2: ldap-local.md5sum.sh:

<begin> #!/bin/bash

# june 2023, by MoHe

# works only as root or sudo if [ "$EUID" -ne 0 ] then echo "Please run as root" exit fi

# set unique tmpdir, so multiple copies of the script can run TMPDIR="/tmp/ldap-md5sum.tmp.$BASHPID"

# set correct slapcat, symas or 'regular' if [ -f "/bin/slapcat" ]; then SLAPCAT="/bin/slapcat" elif [ -f "/sbin/slapcat" ]; then SLAPCAT="/sbin/slapcat" elif [ -f "/opt/symas/sbin/slapcat" ]; then SLAPCAT="/opt/symas/sbin/slapcat" else echo "Sorry, cannot find slapcat..."; exit 1 fi

rm -r /tmp/ldap-local.md5sum

mkdir -p $TMPDIR

$SLAPCAT -d 0 | egrep '^dn:|^entryCSN:|^$' > $TMPDIR/local.dn # making file sortable... awk 'BEGIN {RS=""}{gsub(/\n/,"",$0); print $0}' $TMPDIR/local.dn > $TMPDIR/local.dn.sortable # and do the sorting... sort $TMPDIR/local.dn.sortable > $TMPDIR/local.dn.sorted # and create the md5sum result for zabbix to read md5sum < $TMPDIR/local.dn.sorted > /tmp/ldap-local.md5sum

rm -r $TMPDIR <end>

We have been doing this for more than a year now, and for us it has been completely reliable.

Of course we're open for feedback, suggestions or improvements.

Op 25-12-2024 om 17:24 schreef Dave:

...

Hello Everyone,

Hope you are enjoying the day.

Was curious what people are doing to monitor their MMR clusters being out of sync.

At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.

How is the community doing it? Any better way to go about this?

Any input is appreciated!

Best, Dave

Ondřej Kuzník

2 Jan 2 Jan

3:08 a.m.

On Wed, Dec 25, 2024 at 04:24:49PM -0000, Dave wrote:

...

Hello Everyone,

Hope you are enjoying the day.

Was curious what people are doing to monitor their MMR clusters being out of sync.

At least the implementation we found and have been trying is using telegraf and the ldap_org plugin to gather all objects on each ldap master and compare their count.

How is the community doing it? Any better way to go about this?

Hi, I've seen two main approaches, sometimes combined: 1. Tracking contextCSN/cookie: a) read out the contextCSN from the DB's top-level entry (poll) b) have the server push any cookie changes on-line (push), you also get to discover the provider's serverID this way 2. Read the olmMDBEntries from cn=monitor and make sure those stay in sync

The former gives you more information and is my go-to, but its use in monitoring can be confusing: each serverID CSN has to be compared independently, you cannot do straight time arithmetic for alerting, ...

Some of that is abstracted away by syncmonitor[0] which should be easy to adapt for most monitoring solutions and can even do real-time monitoring + alerting. It is under active development, most recently in the textual branch to expose a TUI frontend and refactor the library to track the cookies on a per-SID basis for real-time replication delay measurement.

Both of them can and often are run in tandem - entry count is useful at catching misconfiguration where ACLs do not give the replication identity the intended permissions (or only for accesslog but not main DB). In deltasync you also have to monitor accesslog DB *never* runs out of space, the desyncs that result from such a failure are not recoverable save by identifying a canonical provider and using it to reseed the cluster manually.

[0]. https://git.openldap.org/openldap/syncmonitor

Regards,

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Dave Macias

6 Jan 6 Jan

8:41 a.m.

Thank you Ondřej for the reply!

...

Read the olmMDBEntries from cn=monitor and make sure those stay in sync

The olmMDBEntries help me out a lot! Thank you for this pointer. Since we already gather cn=monitor metrics, the data was already there so we just need to adjust our alerting for that.

...

The former gives you more information and is my go-to, but its use in monitoring can be confusing: each serverID CSN has to be compared independently, you cannot do straight time arithmetic for alerting, ...

Any tool recommendations for this?

...

Some of that is abstracted away by syncmonitor[0] which should be easy to adapt for most monitoring solutions and can even do real-time monitoring + alerting. It is under active development, most recently in the textual branch to expose a TUI frontend and refactor the library to track the cookies on a per-SID basis for real-time replication delay measurement.

What is not "abstracted away by syncmonitor" ?

Best, Dave

Ondřej Kuzník

7 Jan 7 Jan

1:58 a.m.

On Mon, Jan 06, 2025 at 11:41:26AM -0500, Dave Macias wrote:

...

Thank you Ondřej for the reply!

...

Read the olmMDBEntries from cn=monitor and make sure those stay in sync

The olmMDBEntries help me out a lot! Thank you for this pointer. Since we already gather cn=monitor metrics, the data was already there so we just need to adjust our alerting for that.

...
The former gives you more information and is my go-to, but its use in monitoring can be confusing: each serverID CSN has to be compared independently, you cannot do straight time arithmetic for alerting, ...

Any tool recommendations for this?

In the syncmonitor repository, look at the synccheck tool and you might be able to use its output as input for whatever monitoring/alerting system you need. Or use the library if you want something that keeps running and feeding back status changes.

Let me know if there's any more questions when integrating it or you encounter bugs etc., happy to help get this used more widely.

...

...
Some of that is abstracted away by syncmonitor[0] which should be easy to adapt for most monitoring solutions and can even do real-time monitoring + alerting. It is under active development, most recently in the textual branch to expose a TUI frontend and refactor the library to track the cookies on a per-SID basis for real-time replication delay measurement.

What is not "abstracted away by syncmonitor" ?

Not sure what the question is but the thing syncmonitor library doesn't give you is real-time replication delay measurement. The TUI version might eventually do some of this.

To do delay measurement you need a full history of all server's contextCSNs over time and measure the lag from that (= when was the last time the originating server's CSN was lower or equal to mine?). So the only thing you can get is a point in time replication delay measurement every time you run it (capped to e.g. 30s when it gives up and declares the server behind because monitoring solutions might declare the script itself timed out).

Regards,

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

186

Age (days ago)

199

Last active (days ago)

openldap-technical@openldap.org

7 comments

6 participants

tags (0)

participants (6)

Dave
Dave Macias
Dirk-Willem van Gulik
Ondřej Kuzník
sacawulu
Stefan Kania