On 2/15/2022 1:57 AM, Ondřej Kuzník wrote:
- your DB is just over the size of available RAM by itself
Yes, but that size includes not just the data itself, but all of the indexes as well, right?
- after a while using the system, other processes (and slapd) will carve out a fair amount of it that the system will be unwilling/unable to page out
Yes. But that is not currently the case. Here is a slapd process on one of our nodes that has been up about a week and a half:
ldap 1207 1 9 Feb04 ? 1-01:46:47 /opt/symas/lib/slapd -d 0 -h ldap:/// ldaps:/// ldapi:/// -u ldap -g ldap
It's resident set is a bit less than a gigabyte:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1207 ldap 20 0 8530708 954688 829836 S 28.1 47.8 1546:40 slapd
While unused (ie wasted) memory is only 82M, the amount of memory in use by buffer/cache that the system would be willing to give up at any time is more than a gigabyte:
total used free shared buff/cache available Mem: 1949 413 82 0 1453 1382 Swap: 1023 295 728
When the problem occurs, there isn't a memory shortage. There is still free memory. Nothing is getting paged in or out, the only IO is application read, not system swap.
- if, to answer that query, you need to crawl a large part of the DB, the OS will have to page that part into memory, at the beginning, there is enough RAM to do it all just once, later, you've reached a threshold and it needs to page bits in and then drop them again to fetch others you develop these symptoms - lots or read I/O and a delay in processing
Intuitively that does sound like a good description of the problem I'm having. But the only thing that takes a long time is returning the memberOf attribute. When queries requesting that are taking more than 30 seconds or even minutes to respond, all other queries remain instantaneous. It seems unlikely that under memory pressure the only queries that would end up having to page out stuff and be degraded would be those? Every other query just happens to have what it needs still in memory?
Figure out what is involved in that search and see if you can tweak
It's not a very complicated query:
# ldapsearch -x -H ldapi:/// uid=henson memberOf [...] dn: uid=henson,ou=user,dc=cpp,dc=edu
memberOf: uid=idm,ou=group,dc=cpp,dc=edu
memberOf: uid=iit,ou=group,dc=cpp,dc=edu
If I understand correctly, this just needs to access the index on uid to find my entry, and then the dynlist module presumably does something like this:
# ldapsearch -x -H ldapi:/// member=uid=henson,ou=user,dc=cpp,dc=edu dn [...] # cppnet, group, cpp.edu dn: uid=cppnet,ou=group,dc=cpp,dc=edu # cppnet-admin, group, cpp.edu dn: uid=cppnet-admin,ou=group,dc=cpp,dc=edu
this just needs to access the index on member to find all of the group objects, which in my case is 36.
So it only needs to have two indexes and 37 objects in memory to perform quickly, right?
When performance on memberOf queries is degraded, this takes more than 30 seconds to run. Every single time. I could run it 20 times in a row and it always takes more than 30 seconds. If it was a memory issue, you would think that at least some of the queries would get lucky and the pages needed would be in memory, given they had just been accessed moments before?
I can certainly just throw memory at it and hope the problem goes away. But based on the observations when it occurs it does not feel like just a memory problem. The last time it happened I pulled the node out of the load balancer so nothing else was poking at it and the test query was still taking more than 30 seconds.
I'm going to bump the production nodes up to 4G, which should be more than enough to run the OS and always have the entire database plus all indexes in memory. I will keep my fingers crossed this problem just goes away, but if it doesn't, what else can I do when it occurs to help track it down?
Thanks much…