intermittent memberOf performance issues - openldap-technical

4 Feb 2022


      I've run into another problem with the memberOf implementation on my 2.5 
servers. After I sorted out the proper configuration, queries requesting 
  memberOf were very performant:
Feb  4 13:26:44 ldap-01 slapd[1207]: conn=23393 op=1 SRCH 
base="ou=user,dc=cpp,dc=edu" scope=2 deref=3 
filter="(&(objectClass=person)(calstateEduPersonEmplID=013522522))"
Feb  4 13:26:44 ldap-01 slapd[1207]: conn=23393 op=1 SRCH attr=memberOf
Feb  4 13:26:44 ldap-01 slapd[1207]: conn=23393 op=1 SEARCH RESULT 
tag=101 err=0 qtime=0.000015 etime=0.191860 nentries=1 text=
However, intermittently the server gets into a state where the exact 
same query takes over 30 seconds:
Feb  4 08:05:11 ldap-01 slapd[1425456]: conn=40797 op=1 SRCH 
base="ou=user,dc=cpp,dc=edu" scope=2 deref=3 
filter="(&(objectClass=person)(calstateEduPersonEmplID=015559557))"
Feb  4 08:05:11 ldap-01 slapd[1425456]: conn=40797 op=1 SRCH attr=memberOf
Feb  4 08:05:50 ldap-01 slapd[1425456]: conn=40797 op=1 SEARCH RESULT 
tag=101 err=0 qtime=0.000019 etime=39.435523 nentries=1 text=
When this occurs, the only way to resolve the issue that I have found is 
to reboot the server. Simply restarting slapd results in the same 
degraded performance on these queries.
Normally there is very low read I/O load on the servers during 
operation, probably averaging less than 1M/s, peaking up to maybe 
20-30M/s for just an instant occasionally. When the memberOf query 
performance is degraded, there is a very high read I/O load on the 
server, continuously about 200-300M/s.
Any thoughts on this? It seems like for some reason the server gets into 
a state where it is not using the cache or memory map for doing the 
search required to construct the memberOf results? But instead is doing 
a full disk read of the entire database?
It's also weird that restarting the service is not resolve this, but 
rebooting the server does. I'm not intimately familiar with the 
internals of lmdb, is there some state that persists with the 
environment or memory map in between service runs that is only cleared 
by a reboot?
I initially thought I might have had a theory on it, relating to an 
unrelated bug in RHEL 8.5 that broke the "needs-rebooting" command 
resulting in servers not properly rebooting after kernel/library 
updates. The most recent occurrence of this issue started up after such 
an update without the required reboot, but upon reviewing historic 
occurrences it has occurred at times that don't meet that criteria, so I 
find myself clueless again as to what's going on.
Any advice on how to fix or do further debugging on this issue much 
appreciated, thanks…