https://bugs.openldap.org/show_bug.cgi?id=10274
Issue ID: 10274 Summary: Replication issue on MMR configuration Product: OpenLDAP Version: 2.5.14 Hardware: All OS: Linux Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: falgon.comp@gmail.com Target Milestone: ---
Created attachment 1036 --> https://bugs.openldap.org/attachment.cgi?id=1036&action=edit In this attachment you will find 2 openldap configurations for 2 instances + slamd conf exemple + 5 screenshots to show the issue and one text file to explain what you see
Hello we are openning this issue further to the initial post in technical : https://lists.openldap.org/hyperkitty/list/openldap-technical@openldap.org/t...
Issue : We are working on a project and we've come across an issue with the replication after performance testing :
*Configuration :*
RHEL 8.6 OpenLDAP 2.5.14 *MMR-delta *configuration on multiple servers attached 300,000 users configured and used for tests *olcLastBind: TRUE*
Use of SLAMD (performance shooting)
*Problem description:*
We are currently running performance and resilience tests on our infrastructure using the SLAMD tool configured to perform BINDs and MODs on a defined range of accounts.
We use a load balancer (VIP) to poll all of our servers equally. (but it is possible to do performance tests directly on each of the directories)
With our current infrastructure we're able to perform approximately 300 MOD/BIND/s. Beyond that, we start to generate delays and can randomly come across one issue.
However, when we run performance tests that exceed our write capacity, our replication between servers can randomly create an incident with directories being unable to catch up with their replication delay.
The directories update their contextCSNs, but extremely slowly (like freezing). From then on, it's impossible for the directories to catch again. (even with no incoming traffic)
A restart of the instance is required to perform a full refresh and solve the incident.
We have enabled synchronization logs and have no error or refresh logs to indicate a problem ( we can provide you with logs if necessary).
We suspect a write collision or a replication conflict but this is never write in our sync logs.
We've run a lot of tests.
For example, when we run a performance test on a single live server, we don't reproduce the problem.
Anothers examples: if we define different accounts ranges for each server with SALMD, we don't reproduce the problem either.
If we use only one account for the range, we don't reproduce the problem either.
______________________________________________________________________ I have add some screenshots on attachement to show you the issue and all the explanations. ______________________________________________________________________
*Symptoms :*
One or more directories can no longer be replicated normally after performance testing ends. No apparent error logs. Need a restart of instances to solve the problem.
*How to reproduce the problem:*
Have at least two servers in MMR mode Set LastBind to TRUE Perform a SLAMD shot from a LoadBalancer in bandwidth mode OR start multiple SLAMD test on same time for each server with the same account range. Exceed the maximum write capacity of the servers.
*SLAMD configuration :*
authrate.sh --hostname ${HOSTNAME} --port ${PORTSSL} \ --useSSL --trustStorePath ${CACERTJKS} \ --trustStorePassword ${CACERTJKSPW} --bindDN "${BINDDN}" \ --bindPassword ${BINDPW} --baseDN "${BASEDN}" \ --filter "(uid=[${RANGE}])" --credentials ${USERPW} \ --warmUpIntervals ${WARMUP} \ --numThreads ${NTHREADS} ${ARGS}