Full_Name: Toby Blake
Version: 2.3.38 amd 2.3.40
OS: Fedora Core 5 and Fedora Core 6
URL:
Submission from: (NULL) (129.215.218.33)
Hi there,
We're seeing a (seemingly) intermittent problem when using back-ldap
(with or without using the pcache overlay for caching), where 1 in x
queries fail with "result: 80 Internal (implementation specific)
error", where x is anywhere between 2 and 10. We're seeing this with
both openldap-2.3.38 and openldap-2.3.40 - running on Fedora Core 5
and 6. The remote server is running FC5, openldap 2.3.38. I
currently have a machine exhibiting this problem - it's a machine in a
student lab.
The database part of slapd.conf on this machine is:
database ldap
suffix dc=inf,dc=ed,dc=ac,dc=uk
rootdn
uid=ldaprep/inganoust.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth
uri ldap://testdir.inf.ed.ac.uk/
idassert-bind mode=none
bindmethod=sasl
saslmech=GSSAPI
idassert-authzFrom "dn:*"
I'm afraid I can't currently reproduce this error, but hopefully some
of the information below will help...
I've done some stepping through the code and what seems to be
happening when the error occurs is that the call to getpeername in
ldap_host_connected_to in libraries/libldap/os-ip.c:590 fails with
errno=107 "transport endpoint is not connected" and thereafter
ldap_int_sasl_open (cyrus.c:518) is called with NULL as the 'host'
argument (in our case it should be 'hadrian.inf.ed.ac.uk') - this
results in LDAP_LOCAL_ERROR being returned.
The backtrace (in ldap_host_connected_to) is:
(gdb) bt
#0 ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost")
at os-ip.c:586
#1 0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0,
mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2,
interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98)
at cyrus.c:643
#2 0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0,
mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2,
interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479
#3 0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900,
rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1)
at bind.c:1997
#4 0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4)
at search.c:166
#5 0x0806317f in fe_op_search (op=0x8952900, rs=0xb370f1c4) at search.c:355
#6 0x08063b31 in do_search (op=0x8952900, rs=0xb370f1c4) at search.c:217
#7 0x08061209 in connection_operation (ctx=0xb370f238, arg_v=0x8952900)
at connection.c:1133
#8 0x081322f3 in ldap_int_thread_pool_wrapper (xpool=0x88a1d18) at tpool.c:478
#9 0x00db145b in start_thread () from /lib/libpthread.so.0
#10 0x001cf23e in clone () from /lib/libc.so.6
(gdb)
When getpeername returns the error, it is always for the same 'sd'
value - on the machine I'm looking at, it's 13, e.g.
590 if ( getpeername( sd, sa, &len ) == -1 ) {
(gdb) p sd
$3 = 13
(gdb) p errno
$4 = 107
(gdb)
What's interesting here is that if I use lsof to see the filehandles
that slapd is holding open, I see this for FD 13:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
slapd 2551 ldap 13u sock 0,5 3603708 can't identify protocol
Interestingly, this is the exact same output I see from lsof if a
socket() has been created, but not connect()ed.
It appears then, that a socket of this type gets into the pool of
connections to the remote server being used by the local slapd, thus
explaining the 1 in x failure rate - as it cycles through the pool.
I currently have a machine in this state, so if there's more
information I could usefully provide, then let me know. I can of
course change the loglevel, but that would involve restarting slapd on
this machine - which fixes the problem. Anyway, let me know if
there's more I can do - I can experiment on other machines.
Cheers
Toby Blake
School of Informatics
University of Edinburgh