On Mon, May 30, 2022 at 02:38:11PM +0100, Howard Chu wrote:
Let us know how things go.
Arg. Seems to have been a red herring. Blew up again with swappiness set to 1, and then again with swap completely disabled :(. Usual symptoms of crazy high disk reads:
Total DISK READ : 389.05 M/s | Total DISK WRITE : 3.93 K/s Actual DISK READ: 391.50 M/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 577547 be/4 ldap 36.88 M/s 0.00 B/s 0.00 % 97.92 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 577546 be/4 ldap 32.27 M/s 0.00 B/s 0.00 % 97.88 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 575034 be/4 ldap 29.47 M/s 0.00 B/s 0.00 % 97.72 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 572838 be/4 ldap 27.38 M/s 0.00 B/s 0.00 % 97.66 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 575308 be/4 ldap 24.47 M/s 0.00 B/s 0.00 % 97.50 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 572866 be/4 ldap 91.55 M/s 0.00 B/s 0.00 % 97.33 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 572841 be/4 ldap 26.96 M/s 0.00 B/s 0.00 % 96.87 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 572836 be/4 ldap 43.90 M/s 0.00 B/s 0.00 % 96.84 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap 577508 be/4 ldap 76.17 M/s 0.00 B/s 0.00 % 95.96 % slapd -d 0 -h ldap:/// ~dapi:/// -u ldap -g ldap
Even though there's plenty of memory:
total used free shared buff/cache available Mem: 3901 944 109 1 2847 2715 Swap: 0 0 0
Looking at the lmdb mapping:
00007f662ea25000 5242880 325560 0 rw-s- data.mdb 00007f676ec26000 2097152 0 0 rw-s- data.mdb
There seems to be fewer pages mapped in than on one that isn't blowing up:
00007f6ab1606000 5242880 560712 0 rw-s- data.mdb 00007f6bf1807000 2097152 120772 0 rw-s- data.mdb
Memory use is similar:
total used free shared buff/cache available Mem: 3896 725 156 0 3014 2893 Swap: 2047 127 1920
The one that's unhappy is generating a lot of page faults:
ldap-02 ~ # ps -o min_flt,maj_flt 572833; sleep 10; ps -o min_flt,maj_flt 572833 MINFL MAJFL 11924597 3715970 MINFL MAJFL 11931358 3718833 ldap-02 ~ # ps -o min_flt,maj_flt 572833; sleep 10; ps -o min_flt,maj_flt 572833 MINFL MAJFL 11949883 3726966 MINFL MAJFL 11957081 3730080
Compared to the one that's working properly, which has none:
ldap-01 ~ # ps -o min_flt,maj_flt 1227; sleep 10; ps -o min_flt,maj_flt 1227 MINFL MAJFL 1282224 221928 MINFL MAJFL 1282224 221928 ldap-01 ~ # ps -o min_flt,maj_flt 1227; sleep 10; ps -o min_flt,maj_flt 1227 MINFL MAJFL 1282225 221928 MINFL MAJFL 1282225 221928
But why? Arg. All the slow queries are asking for memberOf:
May 30 21:54:25 ldap-02 slapd[572833]: conn=120576 op=1 SRCH base="ou=user,dc=cpp,dc=edu" scope=2 deref=3 filter="(&(objectClass=person)(calstateEduPersonEmplID=014994057))" May 30 21:54:25 ldap-02 slapd[572833]: conn=120576 op=1 SRCH attr=memberOf May 30 21:56:59 ldap-02 slapd[572833]: conn=120576 op=1 SEARCH RESULT tag=101 err=0 qtime=0.000016 etime=154.273556 nentries=1 text=
There's something going on with the dynlist overlay and memberof queries, but I still can't figure out what <sigh>. It's not a low on memory issue, there's plenty of free memory. But for some reason the read IO goes through the roof. I'm pretty sure it has the same query load while it's freaking as it did when it was running fine.