Hello Everyone,
Hope you are enjoying the day.
Was curious what people are doing to monitor their MMR clusters being out of sync.
At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.
How is the community doing it? Any better way to go about this?
Any input is appreciated!
Best, Dave
On 25 Dec 2024, at 17:25, Dave davama@gmail.com wrote:
Was curious what people are doing to monitor their MMR clusters being out of sync
Have you looked at the change sequence number (CSN) ? Eg either script if and check they are identical.
Or use net-snmp and expose it that way. That is what I usually do.
Dw.
Interesting! Didnt know about that feature. Thank you for the tip! On Dec 25, 2024 at 1:44 PM -0500, Dirk-Willem van Gulik dirkx@webweaving.org, wrote:
On 25 Dec 2024, at 17:25, Dave davama@gmail.com wrote:
Was curious what people are doing to monitor their MMR clusters being out of sync
Have you looked at the change sequence number (CSN) ? Eg either script if and check they are identical.
Or use net-snmp and expose it that way. That is what I usually do.
Dw.
What about slapd-watch? Did you take a look at it?
Am 25.12.24 um 17:24 schrieb Dave:
Hello Everyone,
Hope you are enjoying the day.
Was curious what people are doing to monitor their MMR clusters being out of sync.
At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.
How is the community doing it? Any better way to go about this?
Any input is appreciated!
Best, Dave
Hi,
What we are doing is: generate on each ldap node two files that "fingerprint" the local LDAP DIT. These are:
- the contextCSN's of the local DB - a md5sum of a dump of (many) records in the local DB
When replication is doing it's thing, the contents of these two fingerprints (files) should be identical across all nodes.
We then use zabbix-agent2 to collect and compare the fingerprint across the nodes. The value of course often changes, since ldap is in active use. But as long as the values match every 10 minutes or os, everything is working and there it nothing to worry.
This works (for us) surprisingly well. Below are the two bash scripts. You need to define/adjust the variables for your environment, of course
script #1: ldap-contextCSN.sh: <begin> #!/bin/bash
# oktober 2023, by MoHe
# works only as root or sudo if [ "$EUID" -ne 0 ] then echo "Please run as root" exit fi
rm -r /tmp/ldap-contextCSN
$LDAPSEARCH -H ldapi:// -b $BASE -D $LDAPBINDDN -w $ADMINPW -s base contextCSN | grep -v requesting | grep contextCSN > /tmp/ldap-contextCSN <end>
script #2: ldap-local.md5sum.sh:
<begin> #!/bin/bash
# june 2023, by MoHe
# works only as root or sudo if [ "$EUID" -ne 0 ] then echo "Please run as root" exit fi
# set unique tmpdir, so multiple copies of the script can run TMPDIR="/tmp/ldap-md5sum.tmp.$BASHPID"
# set correct slapcat, symas or 'regular' if [ -f "/bin/slapcat" ]; then SLAPCAT="/bin/slapcat" elif [ -f "/sbin/slapcat" ]; then SLAPCAT="/sbin/slapcat" elif [ -f "/opt/symas/sbin/slapcat" ]; then SLAPCAT="/opt/symas/sbin/slapcat" else echo "Sorry, cannot find slapcat..."; exit 1 fi
rm -r /tmp/ldap-local.md5sum
mkdir -p $TMPDIR
$SLAPCAT -d 0 | egrep '^dn:|^entryCSN:|^$' > $TMPDIR/local.dn # making file sortable... awk 'BEGIN {RS=""}{gsub(/\n/,"",$0); print $0}' $TMPDIR/local.dn > $TMPDIR/local.dn.sortable # and do the sorting... sort $TMPDIR/local.dn.sortable > $TMPDIR/local.dn.sorted # and create the md5sum result for zabbix to read md5sum < $TMPDIR/local.dn.sorted > /tmp/ldap-local.md5sum
rm -r $TMPDIR <end>
We have been doing this for more than a year now, and for us it has been completely reliable.
Of course we're open for feedback, suggestions or improvements.
Op 25-12-2024 om 17:24 schreef Dave:
Hello Everyone,
Hope you are enjoying the day.
Was curious what people are doing to monitor their MMR clusters being out of sync.
At least the implementation we found and have been trying is using telegraf and the [ldap_org](https://github.com/falon/CSI-telegraf-plugins/blob/master/plugins/inputs/lda...) plugin to gather all objects on each ldap master and compare their count.
How is the community doing it? Any better way to go about this?
Any input is appreciated!
Best, Dave
On Wed, Dec 25, 2024 at 04:24:49PM -0000, Dave wrote:
Hello Everyone,
Hope you are enjoying the day.
Was curious what people are doing to monitor their MMR clusters being out of sync.
At least the implementation we found and have been trying is using telegraf and the ldap_org plugin to gather all objects on each ldap master and compare their count.
How is the community doing it? Any better way to go about this?
Hi, I've seen two main approaches, sometimes combined: 1. Tracking contextCSN/cookie: a) read out the contextCSN from the DB's top-level entry (poll) b) have the server push any cookie changes on-line (push), you also get to discover the provider's serverID this way 2. Read the olmMDBEntries from cn=monitor and make sure those stay in sync
The former gives you more information and is my go-to, but its use in monitoring can be confusing: each serverID CSN has to be compared independently, you cannot do straight time arithmetic for alerting, ...
Some of that is abstracted away by syncmonitor[0] which should be easy to adapt for most monitoring solutions and can even do real-time monitoring + alerting. It is under active development, most recently in the textual branch to expose a TUI frontend and refactor the library to track the cookies on a per-SID basis for real-time replication delay measurement.
Both of them can and often are run in tandem - entry count is useful at catching misconfiguration where ACLs do not give the replication identity the intended permissions (or only for accesslog but not main DB). In deltasync you also have to monitor accesslog DB *never* runs out of space, the desyncs that result from such a failure are not recoverable save by identifying a canonical provider and using it to reseed the cluster manually.
[0]. https://git.openldap.org/openldap/syncmonitor
Regards,
openldap-technical@openldap.org