My employer ships software for Linux and other Unix-like OSes that binds to Active Directory in order to, basically, integrate it to AD. Functionally, it is not too dissimilar to pam_krb5 and nss_ldap. OpenLDAP 2.4.18 is used to bind to Active Directory LDAP servers. Authentication (to a machine trust account) is done using a Kerberos keytab. MIT Kerberos is used.
Group membership data are stored in LDAP objects of class Group which have the `member' attribute (multiply) filled with the DN of all members. Those DNs are of type Group or of type User (I'm just chasing users for now), and their `sAMAccountName' value is what I need to give to NSS as the group member's name.
My procedure is as follows: First, I bind to one of several configured LDAP servers using SASL2/GSSAPI, i.e. Kerberos 5. Then I inquire all of the result set's `member' attributes and resolve the resulting DNs one by one to build a DN => sAMAccountName map in memory (that's about 10k entries, so, not a problem here). Then, I request the actual group entries and look up the DNs in the `member' attribute in the map. Last, the connection is terminated.
The group members' `sAMAccountName' is inquired one by one with the base set to the DN (which I already know), the scope set to flat, and the filter set to (objectClass=*). So that's about 10k single queries in quick succession. The whole group query typically takes about 6 seconds.
The problem is: OpenLDAP sometimes gives me LDAP_SERVER_DOWN during the `sAMAccountName' queries. This occurs sporadically but then typically for the rest of the `sAMAccountName' queries. The group entry query that follows does succeed. Most of the time the first of those errors immediately follows a GSSAPI error, nameley, DES key is a weak key, which may be true but appears unrelated, since only AES512, AES256 and HMAC are used in the keytab.
The customer's DC admins say that the DCs are not at fault. We asked them to try to increase the server limits (such as max number of active queries per worker thread, which defaults to 4 or something), since about 2k client workstations molest 4 DCs every 30 minutes with the mentioned query (and others). They are very reluctant to do that and we have trouble replicating the problem in-house anyway, so overload need not be the root cause. The DC logs reportedly show nothing unusual.
My software has been in use for about three months now, but the rollout was still on-going until lately. Customer acceptance tests did not report the problem. The first incident has been reported about two weeks ago. They set up monitoring and the bell now rings every couple of minutes somewhere. Most queries still get through without problem, though. Other queries, such as those for Netgroups, do not seem to have any problem. They also doubled the number of DCs (to four) four weeks ago, since the two they had were quite busy. The new DCs are similar in hardware but are significantly less under load, which seems odd since all four of those DCs exist just to serve our software and our software can be shown to distribute the queries fairly. Restricting our software to only the old or only the new DCs does not have any effect on the failure rate.
I increased the OpenLDAP log level, but nothing enlightening turned up.
What could be the cause of the failures during the group member resolution?