Hi OpenLdap folks,
I ran into an issue with OpenLdap 2.4.44 that I am having trouble finding the root cause of.
I run Openldap in syncrepl mode. I have one machine which serves as a write endpoint (let’s call it the master node), and many machines which sync from it, and serve as read-replicas.
To ensure that they are in-sync with the Master, each read-replica runs ldapsearch against the Master node every minute. It looks at the entryCSN values for a bunch of objects on the Master, and compares against its own entryCSNs for its copy of these objects. It searches a bunch of different objects, and in total takes about 3 seconds for a read replica to do this search (I have duration logging on LDAP operations enabled by merging in this patch (http://www.openldap.org/its/index.cgi/Software%20Enhancements?id=8054;page=9). About 20 MB are transferred to each read-replica when they run this script.
NOTE: I prefer not to use the contextCSN for this sync because I only care about certain objects of the database being in-sync, and I need to know specifically which objects are in-sync vs out-of-sync.
I doubled the amount of times this script runs per read-replica. Therefore instead of each read-replica running this script once per minute, it was running it twice per minute.
Shortly thereafter, I started getting reports from someone who writes to the LDAP Master regularly that they are seeing a high amount of write operations failing with timeouts and “Connection Refused” errors. I reduced the frequency of the script back to once per minute, and the writer reported that they were no longer seeing these errors.
I assumed that this Connection Refused error was due to the fact that Openldap 2.4 uses a single thread for incoming connections (sources: https://lwn.net/Articles/755207/, https://www.openldap.org/pub/slim/OpenLDAP_Conn_Mgmt.pdf (section 3)), and the pending connection backlog on the socket was too high. Therefore the syscall is returning Connection Refused. This may be similar to the frontend contention issue described in this post: (http://www.openldap.org/lists/openldap-devel/201308/msg00003.html).
I noticed that the values for cn=Backload,cn=Threads,cn=Monitor as well as cn=Pending,cn=Threads,cn=Monitor got very high when the read-replicas were running the script twice as much. For example, Pending is usually sitting around 5-6, but during the time of high read traffic, I saw Pending count increase by over 1000 times (my graph looks very spiky, with pending threads shooting up to 1000x, then down to 10x or 100x the next minute, then back up, etc.). I understand that cn=Backload is simply Active + Pending Threads, and interestingly Active threads stayed at normal levels. I am wondering what Pending threads means exactly, and how is Pending Threads different from Read/Write Waiters? (Interestingly, Read/Write Waiters stayed at normal levels.)
I attempted to reproduce this issue by running the script concurrently from a few different clients, hoIver, I was unable to get the Pending/Backload Threads up to similar levels (this value hovered around 16, which seems healthy. I did not see it spike up to similarly high levels). I observed that the latency of the Master from the read-replica’s perspective increased quite a bit during this test, but was unable to observe Connection Refused issues.
Is my assumption about the cause of this issue (single thread for incoming connections) down the right track? Is this behavior (high Pending/Backload Threads, Connection Refused errors) a known occurrence? Are there any other metrics that I can observe which would indicate what is the cause of the Connection Refused errors? Is there a reliable way to repro this issue (without doubling the frequency of the read-replica script)?
NOTE: I have the following settings configured, which I suspect may be relevant: olcConcurrency: 0 olcConnMaxPending: 100 olcConnMaxPendingAuth: 1000 olcGentleHUP: FALSE olcIdleTimeout: 60 olcIndexSubstrIfMaxLen: 4 olcIndexSubstrIfMinLen: 2 olcIndexSubstrAnyLen: 4 olcIndexSubstrAnyStep: 2 olcIndexIntLen: 4 olcListenerThreads: 1 olcLocalSSF: 71 olcLogLevel: Stats olcLogLevel: Sync olcSizeLimit: unlimited olcSockbufMaxIncoming: 262143 olcSockbufMaxIncomingAuth: 16777215 olcThreads: 16 olcToolThreads: 1 olcWriteTimeout: 0
Thanks,
Sent with [ProtonMail](https://protonmail.com) Secure Email.
--On Monday, August 05, 2019 7:48 PM +0000 erich_2323 erich_2323@protonmail.com wrote:
Hi OpenLdap folks,
I ran into an issue with OpenLdap 2.4.44 that I am having trouble finding the root cause of.
What database backend are you using to store your data in?
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Thanks for the reply Quanah.
I am using MDB as by the database backend.
objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {2}mdb
Thanks,
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, August 5, 2019 12:30 PM, Quanah Gibson-Mount quanah@symas.com wrote:
--On Monday, August 05, 2019 7:48 PM +0000 erich_2323 erich_2323@protonmail.com wrote:
Hi OpenLdap folks, I ran into an issue with OpenLdap 2.4.44 that I am having trouble finding the root cause of.
What database backend are you using to store your data in?
--Quanah
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
--On Monday, August 05, 2019 10:08 PM +0000 erich_2323 erich_2323@protonmail.com wrote:
Thanks for the reply Quanah.
I am using MDB as by the database backend.
objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {2}mdb
Ok. First, I would strongly advise updating to the latest OpenLDAP release (2.4.48). Second, what is the mdb_stat -eaf output for the master's database? Third, I would take advantage of the rtxnsize setting in newer OpenLDAP releases to ensure you're not fragmenting your database with these searches. Fourth, I would not use the patch as-is in that ITS, but backport what was actually committed to OpenLDAP master, as the patch in the ITS has some negative impacts on performance that the actual committed code addressed.
Additional things you may wish (a) ensure you have proper indexing for the filter you're using for the search (eq on entryCSN + whatever other components make up your search), (b) Your olcConnMaxPending of 100 only allows for 100 pending operations across all clients, it sounds like you may need to increase that (you say you have many read-only replicas, but don't provide a number).
Hope that helps!
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
openldap-technical@openldap.org