My employer ships software for Linux and other Unix-like OSes that binds
to Active Directory in order to, basically, integrate it to AD.
Functionally, it is not too dissimilar to pam_krb5 and nss_ldap.
OpenLDAP 2.4.18 is used to bind to Active Directory LDAP servers.
Authentication (to a machine trust account) is done using a Kerberos
keytab. MIT Kerberos is used.
Group membership data are stored in LDAP objects of class Group which
have the `member' attribute (multiply) filled with the DN of all
members. Those DNs are of type Group or of type User (I'm just chasing
users for now), and their `sAMAccountName' value is what I need to give
to NSS as the group member's name.
My procedure is as follows: First, I bind to one of several configured
LDAP servers using SASL2/GSSAPI, i.e. Kerberos 5. Then I inquire all of
the result set's `member' attributes and resolve the resulting DNs one
by one to build a DN => sAMAccountName map in memory (that's about 10k
entries, so, not a problem here). Then, I request the actual group
entries and look up the DNs in the `member' attribute in the map. Last,
the connection is terminated.
The group members' `sAMAccountName' is inquired one by one with the base
set to the DN (which I already know), the scope set to flat, and the
filter set to (objectClass=*). So that's about 10k single queries in
quick succession. The whole group query typically takes about 6 seconds.
The problem is: OpenLDAP sometimes gives me LDAP_SERVER_DOWN during the
`sAMAccountName' queries. This occurs sporadically but then typically
for the rest of the `sAMAccountName' queries. The group entry query that
follows does succeed. Most of the time the first of those errors
immediately follows a GSSAPI error, nameley, DES key is a weak key,
which may be true but appears unrelated, since only AES512, AES256 and
HMAC are used in the keytab.
The customer's DC admins say that the DCs are not at fault. We asked
them to try to increase the server limits (such as max number of active
queries per worker thread, which defaults to 4 or something), since
about 2k client workstations molest 4 DCs every 30 minutes with the
mentioned query (and others). They are very reluctant to do that and we
have trouble replicating the problem in-house anyway, so overload need
not be the root cause. The DC logs reportedly show nothing unusual.
My software has been in use for about three months now, but the rollout
was still on-going until lately. Customer acceptance tests did not
report the problem. The first incident has been reported about two weeks
ago. They set up monitoring and the bell now rings every couple of
minutes somewhere. Most queries still get through without problem,
though. Other queries, such as those for Netgroups, do not seem to have
any problem. They also doubled the number of DCs (to four) four weeks
ago, since the two they had were quite busy. The new DCs are similar in
hardware but are significantly less under load, which seems odd since
all four of those DCs exist just to serve our software and our software
can be shown to distribute the queries fairly. Restricting our software
to only the old or only the new DCs does not have any effect on the
failure rate.
I increased the OpenLDAP log level, but nothing enlightening turned up.
What could be the cause of the failures during the group member resolution?