Hello, I'm working on the same project as Meheni.
Thanks for your answer, we'll try version 2.6 OpenLDAP using the lastbind-precision.
However we have several questions for the current version we're using.
Is this a known problem and referenced somewhere? (we haven't found it)
Is it normal to find no replication error logs even in stats + sync mode?
We ran some tests in sequential mode (300,000 accounts one after the other) and managed to reproduce the problem.
-Denis
Le mer. 11 oct. 2023 à 14:11, Quanah Gibson-Mount quanah@fast-mail.org a écrit :
--On Tuesday, October 10, 2023 9:30 PM +0200 Ziani Meheni mehani06@gmail.com wrote:
Hello, we are working on a project and we've come across a problem with the replication after performance testing :
You need to use OpenLDAP 2.6 and then set the:
lastbind-precision
value. I use 5 minutes.
--Quanah
--On Monday, October 16, 2023 9:33 AM +0200 Falgon falgon.comp@gmail.com wrote:
Hello, I'm working on the same project as Meheni.
Thanks for your answer, we'll try version 2.6 OpenLDAP using the lastbind-precision.
However we have several questions for the current version we're using.
Is this a known problem and referenced somewhere? (we haven't found it)
Generally, with something like lastbind, you'll run into collissions of the timestamp, which will cause a lot of havoc with replication. It is not the only case where this can occur. I highly advise reading the caveats in the admin guide about MPR replication.
Is it normal to find no replication error logs even in stats + sync mode?
No. You'd have to provide more information.
--Quanah
Hello, sorry for the delay. Thank's for the answers,
Generally, with something like lastbind, you'll run into collissions of the timestamp, which will cause a lot of havoc with replication. It is not the only case where this can occur. I highly advise reading the caveats in the admin guide about MPR replication.
Yes, that's what we thought at first, but with the various tests we've carried out, we're doubtful about the collision problem. When testing with a single account that BIND more than 500 times per second, we can't reproduce the problem. The same applies to 10 accounts looping at 500 BIND/s.
No. You'd have to provide more information.
We searched in all our logs in stats + sync mode and found no specific error logs for replication. What are the exact messages we should find in case of a collision problem?
--On Tuesday, November 7, 2023 12:56 PM +0000 falgon.comp@gmail.com wrote:
Hello, sorry for the delay. Thank's for the answers,
Generally, with something like lastbind, you'll run into collissions of the timestamp, which will cause a lot of havoc with replication. It is not the only case where this can occur. I highly advise reading the caveats in the admin guide about MPR replication.
Yes, that's what we thought at first, but with the various tests we've carried out, we're doubtful about the collision problem. When testing with a single account that BIND more than 500 times per second, we can't reproduce the problem. The same applies to 10 accounts looping at 500 BIND/s.
So I'm looking at your configuration and have some question:
a) olcPasswordCryptSaltFormat: $6$rounds=10000$%.16s -> Why are you using crypt passwords? OpenLDAP ships with multiple, secure module for password hashing, such as argon2. I'd advise using that. Note that crypt is non-portable.
b) olcLogLevel: stats sync
This generally should be:
olcLogLevel: stats olcLogLevel: sync
c) olcPasswordHash: {CRYPT} -> See (a)
d) I'd suggest not using a root password at all for cn=config, and use EXTERNAL auth over ldapi. If you are going to use one, upgrade to argon2
e) Why do you have separate credentions for the monitor db?
f) Delete this index: olcDbIndex: pwdLastSuccess eq,pres
g) olcSpReloadHint: TRUE -> This setting should *not* be on the main DB, delete it from dn: olcOverlay={0}syncprov,olcDatabase={2}mdb,cn=config
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data > 1 day old: olcAccessLogPurge: 01+00:00 00+04:00
i) For the accesslog DB, are you sure this is a large enough size? olcDbMaxSize: 2147483648 or are you hitting 2GB?
Also it appears you're running this test on two slapds running on the same server? That's an incredibly bad idea, since the I/O will conflict massively between the two processes writing to disk.
Quanah Gibson-Mount wrote:
--On Tuesday, November 7, 2023 12:56 PM +0000 falgon.comp(a)gmail.com wrote:
Hello, sorry for the delay. Thank's for the answers,
Generally, with something like lastbind, you'll run into collissions of the timestamp, which will cause a lot of havoc with replication. It is not the only case where this can occur. I highly advise reading the caveats in the admin guide about MPR replication.
Yes, that's what we thought at first, but with the various tests we've carried out, we're doubtful about the collision problem. When testing with a single account that BIND more than 500 times per second, we can't reproduce the problem. The same applies to 10 accounts looping at 500 BIND/s.
So I'm looking at your configuration and have some question:
a) olcPasswordCryptSaltFormat: $6$rounds=10000$%.16s -> Why are you using crypt passwords? OpenLDAP ships with multiple, secure module for password hashing, such as argon2. I'd advise using that. Note that crypt is non-portable.
b) olcLogLevel: stats sync
This generally should be:
olcLogLevel: stats olcLogLevel: sync
c) olcPasswordHash: {CRYPT} -> See (a)
d) I'd suggest not using a root password at all for cn=config, and use EXTERNAL auth over ldapi. If you are going to use one, upgrade to argon2
e) Why do you have separate credentions for the monitor db?
f) Delete this index: olcDbIndex: pwdLastSuccess eq,pres
g) olcSpReloadHint: TRUE -> This setting should *not* be on the main DB, delete it from dn: olcOverlay={0}syncprov,olcDatabase={2}mdb,cn=config
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data > 1 day old: olcAccessLogPurge: 01+00:00 00+04:00
i) For the accesslog DB, are you sure this is a large enough size? olcDbMaxSize: 2147483648 or are you hitting 2GB?
Also it appears you're running this test on two slapds running on the same server? That's an incredibly bad idea, since the I/O will conflict massively between the two processes writing to disk.
Hello, thank you for the answer and the time for reading the config files of Meheni. I will can answer you for all your questions:
a) + c) Why are you using crypt passwords? - We're using Crypt because we're migrating from an old solution to OpenLDAP and the Crypt option is the most secure and compatible for us.
b) olcLogLevel: stats sync - We running our tests with stats only. Meheni probably left this configuration to check before sending the config here.
d) I'd suggest not using a root password at all for cn=config - Thank you for this option, we will probably try it
e) Why do you have separate credentions for the monitor db? - sorry for this i don't understand the word credentions. Do you mean credentials ?
f) Delete this index: olcDbIndex: pwdLastSuccess eq,pres - This Index are used in some filters, + we have trying another architecture with 1 provider and multiples consumers + a referalForward. But yes this is a good idea, we will try our tests without this index. With 300+ BIND/s this Index is constantly recalculated. Thanks
g) olcSpReloadHint: TRUE -> This setting should *not* be on the main DB, - Thanks yes we have it in the two DB, we will delete it.
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data - We've run endurance tests to include purging. This settings is from a month ago and we have change this settings multiples times for testing differents setup.To add the purge during tests, we actually set it to 00+01:00 00+00:03. In the final configuration we will probably set it too : 03+00:00 00+00:03. We found that purging every 3 minutes reduced the impact on performance.
i) + last question : For the accesslog DB, are you sure this is a large enough size? Also it appears you're running this test on two slapds running on the same server?
- This is because of Meheni's configuration when we cleaned up our configuration files to share it here for privacy reasons. (he tried to reproduce it on his virtual machine and reproduced it) We running our test on 4 servers with only one slapd by server. And actually the accesslog size on each server are : 64424509440
Further information : we tested the consumer/provider mode before MPR, but it didn't meet our needs. We have better performances with the current configuration. (all that remains is to find a solution to the replication problem)
I repost a previous question here too : What are the exact messages or errors messages we should find in case of a collision problem?
Thanks again for your time and your help
--On Thursday, November 23, 2023 5:33 PM +0000 falgon.comp@gmail.com wrote:
b) olcLogLevel: stats sync
- We running our tests with stats only. Meheni probably left this
configuration to check before sending the config here.
My point was more that it should be a multi-valued attribute with unique values not a single valued attribute with 2 strings in the value.
e) Why do you have separate credentions for the monitor db?
- sorry for this i don't understand the word credentions. Do you mean
credentials ?
Yes, typo
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data - We've run endurance tests to include purging. This settings is from a month ago and we have change this settings multiples times for testing differents setup.To add the purge during tests, we actually set it to 00+01:00 00+00:03. In the final configuration we will probably set it too : 03+00:00 00+00:03. We found that purging every 3 minutes reduced the impact on performance.
While correct that frequent purging is better, you missed my overall point, which is that when you're testing you likely want to purge data on a shorter timescale when doing a benchmark.
I repost a previous question here too : What are the exact messages or errors messages we should find in case of a collision problem?
*If* there are collisions, you'll see the server falling back to REFRESH mode. But only if you have "sync" logging enabled.
syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync on (%s), switching to REFRESH\n", syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync, switching to REFRESH\n",
--Quanah
Hello, thank's for the answer.
Quanah Gibson-Mount wrote:
--On Thursday, November 23, 2023 5:33 PM +0000 falgon.comp(a)gmail.com wrote:
b) olcLogLevel: stats sync
- We running our tests with stats only. Meheni probably left this
configuration to check before sending the config here.
My point was more that it should be a multi-valued attribute with unique values not a single valued attribute with 2 strings in the value.
Good to know thanks
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data - We've run endurance tests to include purging. This settings is from a month ago and we have change this settings multiples times for testing differents setup.To add the purge during tests, we actually set it to 00+01:00 00+00:03. In the final configuration we will probably set it too : 03+00:00 00+00:03. We found that purging every 3 minutes reduced the impact on performance.
While correct that frequent purging is better, you missed my overall point, which is that when you're testing you likely want to purge data on a shorter timescale when doing a benchmark.
Yes this is what we did
I repost a previous question here too : What are the exact messages or errors messages we should find in case of a collision problem?
*If* there are collisions, you'll see the server falling back to REFRESH mode. But only if you have "sync" logging enabled.
syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync on (%s), switching to REFRESH\n", syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync, switching to REFRESH\n",
--Quanah
We've done a lot of tests to reproduce the synchronization problem with the sync logs enables and our OpenLDAP servers don't switch to REFRESH mode. As we said, our OpenLDAP instances become very slow (as if they were freezing) during replication and they can no longer replicate correctly, resulting in a constant accumulation of delay when the problem occurs. That's really strange. We've done a lot of testing, and we have the following scenarios that we don't understand. -200,000 users in random mode -> Problem occurs -200,000 users in sequential mode -> Problem occurs -1 user in random mode -> No problem -10 users in random mode -> No problem
We still have no explanation for these results.
Again thank's for your time and your help.
Hello,
I'm writing again in this conversation to provide information and bring the topic back.
Since the last message we have implemented the lastbind overlay with a lastbindprecision setting of 1800 which has greatly reduced the replication problem (but has not solved it completely). Over the past month, we've set up and configured other openLDAP directories with the same configuration for a new service. The new directories present the same replication problem. The difference is that, for these new directories, we have a lot more modifications by MOD operations. The lastbind overlay is therefore no longer a solution to this problem.
Have you had any reports of replication problems similar to ours over the year? Have you been able to investigate our case? We need help to find a solution or workaround for our new directories.
If you need any further information, please don't hesitate to ask me questions or read the previous posts in this conversation, including Ziani Meheni's initial message.
Thanks in advance
Le mar. 5 déc. 2023 à 15:29, falgon.comp@gmail.com a écrit :
Hello, thank's for the answer.
Quanah Gibson-Mount wrote:
--On Thursday, November 23, 2023 5:33 PM +0000 falgon.comp(a)gmail.com
wrote:
b) olcLogLevel: stats sync
- We running our tests with stats only. Meheni probably left this
configuration to check before sending the config here.
My point was more that it should be a multi-valued attribute with unique values not a single valued attribute with 2 strings in the value.
Good to know thanks
h) For your benchmark test, this is probably not frequent enough, as the purge will never run since you're saying only data - We've run
endurance
tests to include purging. This settings is from a month ago and we
have
change this settings multiples times for testing differents setup.To
add
the purge during tests, we actually set it to 00+01:00 00+00:03. In
the
final configuration we will probably set it too : 03+00:00 00+00:03.
We
found that purging every 3 minutes reduced the impact on performance.
While correct that frequent purging is better, you missed my overall
point,
which is that when you're testing you likely want to purge data on a shorter timescale when doing a benchmark.
Yes this is what we did
I repost a previous question here too : What are the exact messages or errors messages we should find in case of a collision problem?
*If* there are collisions, you'll see the server falling back to REFRESH mode. But only if you have "sync" logging enabled.
syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync on (%s),
switching
to REFRESH\n", syncrepl.c: Debug( LDAP_DEBUG_SYNC, "do_syncrep2: %s delta-sync lost sync, switching to REFRESH\n",
--Quanah
We've done a lot of tests to reproduce the synchronization problem with the sync logs enables and our OpenLDAP servers don't switch to REFRESH mode. As we said, our OpenLDAP instances become very slow (as if they were freezing) during replication and they can no longer replicate correctly, resulting in a constant accumulation of delay when the problem occurs. That's really strange. We've done a lot of testing, and we have the following scenarios that we don't understand. -200,000 users in random mode -> Problem occurs -200,000 users in sequential mode -> Problem occurs -1 user in random mode -> No problem -10 users in random mode -> No problem
We still have no explanation for these results.
Again thank's for your time and your help.
On Tue, Oct 15, 2024 at 10:09:40AM +0200, Falgon wrote:
Hello,
I'm writing again in this conversation to provide information and bring the topic back.
Since the last message we have implemented the lastbind overlay with a lastbindprecision setting of 1800 which has greatly reduced the replication problem (but has not solved it completely). Over the past month, we've set up and configured other openLDAP directories with the same configuration for a new service. The new directories present the same replication problem. The difference is that, for these new directories, we have a lot more modifications by MOD operations. The lastbind overlay is therefore no longer a solution to this problem.
What is your *slowest* server's maximum sustained write rate? If the *global* write rate is higher than that, that server is guaranteed to fall behind.
Improving that usually boils down to removing contention from the DBs, in order: - each DB on a dedicated disk, separate from the one used for logging, either a raw block device or at least a filesystem with journalling switched *off* - using the internal file logger if logs are your bottleneck - giving the system enough RAM so that searches don't have to hit the disk at all
Regards,
openldap-technical@openldap.org