Hello, we are working on a project and we've come across a problem with the replication after performance testing :

 

Configuration :

RHEL 8.6

OpenLDAP 2.5.14

MMR-delta configuration on multiple servers attached

300,000 users configured and used for tests

olcLastBind: TRUE

Use of SLAMD (performance shooting)

 

Problem description:

We are currently running performance and resilience tests on our infrastructure using the SLAMD tool configured to perform BINDs on a defined range of accounts.

We use a load balancer (VIP) to poll all of our servers equally. (but it is possible to do performance tests directly on each of the directories)

With our current infrastructure and LastBind enabled, we're able to perform 300 BIND/s without any replication delays. Beyond that, we start to generate delays.

However, when we run performance tests that exceed our write capacity, our replication between servers can randomly create an incident with directories being unable to catch up with their replication delay.

The directories update their contextCSNs, but extremely slowly (like freezing). From then on, it's impossible for the directories to catch again. (even with no incoming traffic)

A restart of the instance is required to perform a full refresh and solve the incident.

 

We have enabled synchronization logs and have no error or refresh logs to indicate a problem ( we can provide you with logs if necessary).


We suspect a write collision or a replication conflict

 

We've already run several tests.

For example, when we run a performance test on a single live server, we don't reproduce the problem.

Anothers examples: if we define different accounts ranges for each server with SALMD, we don't reproduce the problem either.

If we use only one account for the range, we don't reproduce the problem either.

 

Symptoms :

One or more directories can no longer be replicated normally after performance testing ends.

No apparent error logs.

Need a restart of instances to solve the problem.

 

How to reproduce the problem:

Have at least two servers in MMR mode

Set LastBind to TRUE

Perform a SLAMD shot from a LoadBalancer in bandwidth mode OR start multiple SLAMD test on same time for each server with the same account range.

Exceed the maximum write capacity of the servers.

 

SLAMD configuration :

authrate.sh --hostname ${HOSTNAME} --port ${PORTSSL} \

               --useSSL --trustStorePath ${CACERTJKS} \

               --trustStorePassword ${CACERTJKSPW} --bindDN "${BINDDN}" \

               --bindPassword ${BINDPW} --baseDN "${BASEDN}" \

               --filter "(uid=[${RANGE}])" --credentials ${USERPW} \

               --warmUpIntervals ${WARMUP} \

               --numThreads ${NTHREADS} ${ARGS}