Full_Name: Toby Blake Version: 2.3.38 amd 2.3.40 OS: Fedora Core 5 and Fedora Core 6 URL: Submission from: (NULL) (129.215.218.33)
Hi there,
We're seeing a (seemingly) intermittent problem when using back-ldap (with or without using the pcache overlay for caching), where 1 in x queries fail with "result: 80 Internal (implementation specific) error", where x is anywhere between 2 and 10. We're seeing this with both openldap-2.3.38 and openldap-2.3.40 - running on Fedora Core 5 and 6. The remote server is running FC5, openldap 2.3.38. I currently have a machine exhibiting this problem - it's a machine in a student lab.
The database part of slapd.conf on this machine is:
database ldap suffix dc=inf,dc=ed,dc=ac,dc=uk rootdn uid=ldaprep/inganoust.inf.ed.ac.uk,cn=inf.ed.ac.uk,cn=gssapi,cn=auth uri ldap://testdir.inf.ed.ac.uk/ idassert-bind mode=none bindmethod=sasl saslmech=GSSAPI idassert-authzFrom "dn:*"
I'm afraid I can't currently reproduce this error, but hopefully some of the information below will help...
I've done some stepping through the code and what seems to be happening when the error occurs is that the call to getpeername in ldap_host_connected_to in libraries/libldap/os-ip.c:590 fails with errno=107 "transport endpoint is not connected" and thereafter ldap_int_sasl_open (cyrus.c:518) is called with NULL as the 'host' argument (in our case it should be 'hadrian.inf.ed.ac.uk') - this results in LDAP_LOCAL_ERROR being returned.
The backtrace (in ldap_host_connected_to) is:
(gdb) bt #0 ldap_host_connected_to (sb=0xb4124950, host=0x81b0efb "localhost") at os-ip.c:586 #1 0x081394b5 in ldap_int_sasl_bind (ld=0xb4131058, dn=0x0, mechs=<value optimized out>, sctrls=0x0, cctrls=0x0, flags=2, interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at cyrus.c:643 #2 0x0813b45d in ldap_sasl_interactive_bind_s (ld=0xb4131058, dn=0x0, mechs=0x88bef98 "GSSAPI", serverControls=0x0, clientControls=0x0, flags=2, interact=0x81293b0 <lutil_sasl_interact>, defaults=0x892df98) at sasl.c:479 #3 0x08109625 in ldap_back_dobind_int (lcp=0xb370e0a8, op=0x8952900, rs=0xb370f1c4, sendok=<value optimized out>, retries=0, dolock=1) at bind.c:1997 #4 0x080dab4b in ldap_back_search (op=0x8952900, rs=0xb370f1c4) at search.c:166 #5 0x0806317f in fe_op_search (op=0x8952900, rs=0xb370f1c4) at search.c:355 #6 0x08063b31 in do_search (op=0x8952900, rs=0xb370f1c4) at search.c:217 #7 0x08061209 in connection_operation (ctx=0xb370f238, arg_v=0x8952900) at connection.c:1133 #8 0x081322f3 in ldap_int_thread_pool_wrapper (xpool=0x88a1d18) at tpool.c:478 #9 0x00db145b in start_thread () from /lib/libpthread.so.0 #10 0x001cf23e in clone () from /lib/libc.so.6 (gdb)
When getpeername returns the error, it is always for the same 'sd' value - on the machine I'm looking at, it's 13, e.g.
590 if ( getpeername( sd, sa, &len ) == -1 ) { (gdb) p sd $3 = 13 (gdb) p errno $4 = 107 (gdb)
What's interesting here is that if I use lsof to see the filehandles that slapd is holding open, I see this for FD 13:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME slapd 2551 ldap 13u sock 0,5 3603708 can't identify protocol
Interestingly, this is the exact same output I see from lsof if a socket() has been created, but not connect()ed.
It appears then, that a socket of this type gets into the pool of connections to the remote server being used by the local slapd, thus explaining the 1 in x failure rate - as it cycles through the pool.
I currently have a machine in this state, so if there's more information I could usefully provide, then let me know. I can of course change the loglevel, but that would involve restarting slapd on this machine - which fixes the problem. Anyway, let me know if there's more I can do - I can experiment on other machines.
Cheers Toby Blake School of Informatics University of Edinburgh