On 2/15/2022 1:57 AM, Ondřej Kuzník wrote:
- your DB is just over the size of available RAM by itself
Yes, but that size includes not just the data itself, but all of the
indexes as well, right?
- after a while using the system, other processes (and slapd) will
carve
out a fair amount of it that the system will be unwilling/unable to
page out
Yes. But that is not currently the case. Here is a slapd process on one
of our nodes that has been up about a week and a half:
ldap 1207 1 9 Feb04 ? 1-01:46:47
/opt/symas/lib/slapd -d 0 -h ldap:/// ldaps:/// ldapi:/// -u ldap -g ldap
It's resident set is a bit less than a gigabyte:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
1207 ldap 20 0 8530708 954688 829836 S 28.1 47.8 1546:40
slapd
While unused (ie wasted) memory is only 82M, the amount of memory in use
by buffer/cache that the system would be willing to give up at any time
is more than a gigabyte:
total used free shared buff/cache
available
Mem: 1949 413 82 0 1453
1382
Swap: 1023 295 728
When the problem occurs, there isn't a memory shortage. There is still
free memory. Nothing is getting paged in or out, the only IO is
application read, not system swap.
- if, to answer that query, you need to crawl a large part of the
DB,
the OS will have to page that part into memory, at the beginning,
there is enough RAM to do it all just once, later, you've reached a
threshold and it needs to page bits in and then drop them again to
fetch others you develop these symptoms - lots or read I/O and a delay
in processing
Intuitively that does sound like a good description of the problem I'm
having. But the only thing that takes a long time is returning the
memberOf attribute. When queries requesting that are taking more than 30
seconds or even minutes to respond, all other queries remain
instantaneous. It seems unlikely that under memory pressure the only
queries that would end up having to page out stuff and be degraded would
be those? Every other query just happens to have what it needs still in
memory?
Figure out what is involved in that search and see if you can tweak
It's not a very complicated query:
# ldapsearch -x -H ldapi:/// uid=henson memberOf
[...]
dn: uid=henson,ou=user,dc=cpp,dc=edu
memberOf: uid=idm,ou=group,dc=cpp,dc=edu
memberOf: uid=iit,ou=group,dc=cpp,dc=edu
If I understand correctly, this just needs to access the index on uid to
find my entry, and then the dynlist module presumably does something
like this:
# ldapsearch -x -H ldapi:/// member=uid=henson,ou=user,dc=cpp,dc=edu dn
[...]
# cppnet, group,
cpp.edu
dn: uid=cppnet,ou=group,dc=cpp,dc=edu
# cppnet-admin, group,
cpp.edu
dn: uid=cppnet-admin,ou=group,dc=cpp,dc=edu
this just needs to access the index on member to find all of the group
objects, which in my case is 36.
So it only needs to have two indexes and 37 objects in memory to perform
quickly, right?
When performance on memberOf queries is degraded, this takes more than
30 seconds to run. Every single time. I could run it 20 times in a row
and it always takes more than 30 seconds. If it was a memory issue, you
would think that at least some of the queries would get lucky and the
pages needed would be in memory, given they had just been accessed
moments before?
I can certainly just throw memory at it and hope the problem goes away.
But based on the observations when it occurs it does not feel like just
a memory problem. The last time it happened I pulled the node out of the
load balancer so nothing else was poking at it and the test query was
still taking more than 30 seconds.
I'm going to bump the production nodes up to 4G, which should be more
than enough to run the OS and always have the entire database plus all
indexes in memory. I will keep my fingers crossed this problem just goes
away, but if it doesn't, what else can I do when it occurs to help track
it down?
Thanks much…