openldap 2.4.57 on 16 core OracleLinux VMs with NVME disk. 8 nodes in n-way multi master configuration, MDB backend, 50k unique DNs. We see about 10,000 auths per minute per node.
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
[2021-04-13 19:15:58] connection_input: conn=150474 deferring operation: binding
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
What knobs can be adjusted to allow for more concurrency? It seems like writes are impacting reads.
*slapd.conf: threads* default is 32, tried 64 and 128 with little improvement
*slapd.conf: syncrepl* Should I increase sessionlog size? Should I increase checkpoint ops? How to determine optimum values?
syncprov-checkpoint 100 5 syncprov-sessionlog 100 syncprov-reloadhint TRUE
*mdb* maxsize 17179869184
*Indices* index objectClass eq,pres index cn,uid,mail,mobile eq,pres,sub index o,ou,dc,preferredLanguage eq,pres index member,memberUid eq,pres index uidNumber,gidNumber eq,pres index memberOf eq index entryUUID eq index entryCSN eq index uniqueMember eq index sAMAccountName eq
*ulimit* bash-4.2$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 482391 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited *n-way config*
serverID 1 ldap://XXXX:12389
syncrepl rid=1 provider=ldap://XXXXXX:12389 bindmethod=simple starttls=yes tls_cert=/opt/slapd/conf/cert.pem tls_cacert=/etc/pki/tls/cert.pem tls_key=/opt/slapd/conf/key.pem binddn="cn=replication_manager,dc=service-accounts,o=Root" credentials="YYYYYY" tls_reqcert=never searchbase="" schemachecking=on type=refreshAndPersist retry="60 +"
(and 7 more) mirrormode on
Any ideas? Thanks -Zetan
Zetan Drableg wrote:
openldap 2.4.57 on 16 core OracleLinux VMs with NVME disk. 8 nodes in n-way multi master configuration, MDB backend, 50k unique DNs. We see about 10,000 auths per minute per node.
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
Probably you could increase # of listeners. In a pure Bind-only workload slapd ought to be able to utilize 100% of all cores.
[2021-04-13 19:15:58] connection_input: conn=150474 deferring operation: binding
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
What knobs can be adjusted to allow for more concurrency? It seems like writes are impacting reads.
You need more information, like I/O wait %, network % utilization, to identify the cause of these latency spikes.
Nobody can suggest what to tune without knowing why the bottleneck occurs.
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
Probably you could increase # of listeners. In a pure Bind-only workload slapd ought to be able to utilize 100% of all cores.
Do you mean the tcp port listeners on the slapd process? Do you think I'm hitting a socket accept queue max backlog or something else?
slapd -h ldap://:389 ldaps://:636
On Tue, Apr 13, 2021 at 1:32 PM Howard Chu hyc@symas.com wrote:
Zetan Drableg wrote:
openldap 2.4.57 on 16 core OracleLinux VMs with NVME disk. 8 nodes in n-way multi master configuration, MDB backend, 50k unique DNs. We see about 10,000 auths per minute per node.
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
Probably you could increase # of listeners. In a pure Bind-only workload slapd ought to be able to utilize 100% of all cores.
[2021-04-13 19:15:58] connection_input: conn=150474 deferring operation: binding
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
What knobs can be adjusted to allow for more concurrency? It seems like writes are impacting reads.
You need more information, like I/O wait %, network % utilization, to identify the cause of these latency spikes.
Nobody can suggest what to tune without knowing why the bottleneck occurs.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Tuesday, April 13, 2021 3:15 PM -0700 Zetan Drableg zetan.drableg@gmail.com wrote:
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
Probably you could increase # of listeners. In a pure Bind-only workload slapd ought to be able to utilize 100% of all cores.
Do you mean the tcp port listeners on the slapd process? Do you think I'm hitting a socket accept queue max backlog or something else?
slapd -h ldap://:389 ldaps://:636
Read the slapd.conf(5) or slapd-config(5) man page section on listener threads.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
I ran mdb_copy -c to compact the LDAP databases. The size went from 2.9G to 140M and the latency problem during inserts went away. I've noticed the LDAP data.mdb is growing about 25M per day. What accounts for the growth of free pages?
Thank you
On Tue, Apr 13, 2021 at 1:32 PM Howard Chu hyc@symas.com wrote:
Zetan Drableg wrote:
openldap 2.4.57 on 16 core OracleLinux VMs with NVME disk. 8 nodes in n-way multi master configuration, MDB backend, 50k unique DNs. We see about 10,000 auths per minute per node.
Under heavy client load, the log shows many "deferring operation: binding" messages in the same second. slapd is using only 400% cpu (of 1600 possible).
Probably you could increase # of listeners. In a pure Bind-only workload slapd ought to be able to utilize 100% of all cores.
[2021-04-13 19:15:58] connection_input: conn=150474 deferring operation: binding
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
What knobs can be adjusted to allow for more concurrency? It seems like writes are impacting reads.
You need more information, like I/O wait %, network % utilization, to identify the cause of these latency spikes.
Nobody can suggest what to tune without knowing why the bottleneck occurs.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--On Friday, April 16, 2021 11:45 AM -0700 Zetan Drableg zetan.drableg@gmail.com wrote:
When I write LDIFs to one node like delete user or remove user from group, we see spikes in authentication latency metrics (what's normally .2 - .5 second response time goes up to 15-30 seconds) across all nodes in the cluster at the same time.
I ran mdb_copy -c to compact the LDAP databases. The size went from 2.9G to 140M and the latency problem during inserts went away. I've noticed the LDAP data.mdb is growing about 25M per day. What accounts for the growth of free pages?
Do you have a lot of large groups that you frequently update?
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
openldap-technical@openldap.org