Hello,
We have a clustered openldap server (slapd version 2.2.23 in two debian sarge servers). These servers are working perfectly as a user's repository from our others servers (mail servers, radius, etc.).
But now we are renovating our servers and we want to upgrade them to a new one based on openldap 2.3.30 (in two debian etch servers), but we are having problems.
In a test environment the new servers seems to work perfectly, but when we connect our mail servers to it we have a lot of errors. The mail servers seems to lost connection with the ldap. In the ldap serves the only errors we could find are a lot of:
Jul 5 09:44:33 canis4 slapd[28723]: conn=5087 fd=249 closed (connection lost)
It seems that it could be a network problem, but the network is the same than in the others.
We could thinking about a default parameter limiting connections or something like that...
Anybody has ever had this same problem? How could we solve it?
Thanks in advance
Angel L. Mateo writes:
Jul 5 09:44:33 canis4 slapd[28723]: conn=5087 fd=249 closed (connection lost)
OpenLDAP 2.2 just logged "closed", OpenLDAP 2.3 log lines also say why it was closed.
If the client sends an Unbind request, slapd logs that. Otherwise, if the client just exist without doing ldap_unbind() first, all slapd knows is that the connection disappeared for some reason. It doesn't know why, all it can say is "connection lost". It's valid for a client to do so, but a bit naughty because the server admin can't see if there is a problem or not.
So - do you see lost connections in the mail server as well? If so it's a network problem. Otherwise not. If it's a client which opens one LDAP connection, does something, and exits, it works fine (though please ask them to change it anyway). However if the client reopens a new LDAP connection when it gets an LDAP error, without closing the old LDAP connection with ldap_unbind(), you've got a file descriptor leak and a memory leak in the client.
(The ldap_unbind function is quite misnamed. It is the all-in-one send-Unbind-and-close-connection function, so and should be used even after an LDAP connection was lost on the client side - which can happen e.g. if the server times out the connection.)
El jue, 05-07-2007 a las 14:24 +0200, Hallvard B Furuseth escribió:
Angel L. Mateo writes:
Jul 5 09:44:33 canis4 slapd[28723]: conn=5087 fd=249 closed (connection lost)
OpenLDAP 2.2 just logged "closed", OpenLDAP 2.3 log lines also say why it was closed.
If the client sends an Unbind request, slapd logs that. Otherwise, if the client just exist without doing ldap_unbind() first, all slapd knows is that the connection disappeared for some reason. It doesn't know why, all it can say is "connection lost". It's valid for a client to do so, but a bit naughty because the server admin can't see if there is a problem or not.
So - do you see lost connections in the mail server as well? If so it's a network problem. Otherwise not. If it's a client which opens one LDAP connection, does something, and exits, it works fine (though please ask them to change it anyway). However if the client reopens a new LDAP connection when it gets an LDAP error, without closing the old LDAP connection with ldap_unbind(), you've got a file descriptor leak and a memory leak in the client.
I find it difficult to think is a client problem, because we are having the problem with at least three different client software (postfix 2.3.8, courier-imap 3.0.8 and freeradius 1.1.3). The errors are provocating that users are being not found in the clients, and I know that with our previous ldap versions there isn't any problem because we don't have "user not found" errors in the clients.
Angel L. Mateo writes:
I find it difficult to think is a client problem, because we are having the problem with at least three different client software (postfix 2.3.8, courier-imap 3.0.8 and freeradius 1.1.3). The errors are provocating that users are being not found in the clients, and I know that with our previous ldap versions there isn't any problem because we don't have "user not found" errors in the clients.
Could be quite unrelated. For example, did you just point the new LDAP server at the data written by the old server? I don't know if the database formats are compatible. It might help to slapcat the old database to an LDIF file (preferably using the old OpenLDAP version), move away the database files, and slapadd them back with the new OpenLDAP version.
El vie, 06-07-2007 a las 16:50 +0200, Hallvard B Furuseth escribió:
Could be quite unrelated. For example, did you just point the new LDAP server at the data written by the old server? I don't know if the database formats are compatible. It might help to slapcat the old database to an LDIF file (preferably using the old OpenLDAP version), move away the database files, and slapadd them back with the new OpenLDAP version.
I have loaded ldap data with a slapadd from a ldif generated with a slapcat.
"Angel L. Mateo" amateo@um.es writes:
El jue, 05-07-2007 a las 14:24 +0200, Hallvard B Furuseth escribió:
Angel L. Mateo writes:
Jul 5 09:44:33 canis4 slapd[28723]: conn=5087 fd=249 closed (connection lost)
OpenLDAP 2.2 just logged "closed", OpenLDAP 2.3 log lines also say why it was closed.
If the client sends an Unbind request, slapd logs that. Otherwise, if the client just exist without doing ldap_unbind() first, all slapd knows is that the connection disappeared for some reason. It doesn't know why, all it can say is "connection lost". It's valid for a client to do so, but a bit naughty because the server admin can't see if there is a problem or not.
So - do you see lost connections in the mail server as well? If so it's a network problem. Otherwise not. If it's a client which opens one LDAP connection, does something, and exits, it works fine (though please ask them to change it anyway). However if the client reopens a new LDAP connection when it gets an LDAP error, without closing the old LDAP connection with ldap_unbind(), you've got a file descriptor leak and a memory leak in the client.
I find it difficult to think is a client problem, because we are having the problem with at least three different client software (postfix 2.3.8, courier-imap 3.0.8 and freeradius 1.1.3). The errors are provocating that users are being not found in the clients, and I know that with our previous ldap versions there isn't any problem because we don't have "user not found" errors in the clients.
As you can see from the logs, there is no bind operation, which has to be. Are the clients configured to use ldapv2?
-Dieter
Dieter Kluenter wrote:
As you can see from the logs, there is no bind operation, which has to be. Are the clients configured to use ldapv2?
With LDAPv3 (in opposite to LDAPv2) you don't need a bind operation for anonymous LDAP access. The application MAY immediately send a search request.
Ciao, Michael.
"Angel L. Mateo" amateo@um.es writes:
Hello,
We have a clustered openldap server (slapd version 2.2.23 in two debian sarge servers). These servers are working perfectly as a user's repository from our others servers (mail servers, radius, etc.).
But now we are renovating our servers and we want to upgrade them to a new one based on openldap 2.3.30 (in two debian etch servers), but we are having problems.
In a test environment the new servers seems to work perfectly, but when we connect our mail servers to it we have a lot of errors. The mail servers seems to lost connection with the ldap. In the ldap serves the only errors we could find are a lot of:
Jul 5 09:44:33 canis4 slapd[28723]: conn=5087 fd=249 closed (connection lost)
I have seen this with a malformed perl script, which was not properly binding and unbinding to the directory but only closing connection.
slapd[6955]: conn=2 fd=14 ACCEPT from PATH=/usr/local/var/run/ldapi (PATH=/usr/local/var/run/ldapi) slapd[6955]: conn=2 fd=14 closed (connection lost)
So double check the configuration of your mail servers and raise loglevel.
-Dieter
Hello,
This is the logs registered with an increased log level:
Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 14 Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 31 Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 44 Jul 6 09:01:21 canis4 slapd[10389]: connection_read(44): input error=-2 id=84, closing. Jul 6 09:01:21 canis4 slapd[10389]: daemon: removing 44 Jul 6 09:01:21 canis4 slapd[10389]: conn=84 fd=44 closed (connection lost)
On 2007, Jul 6, at 02:09, Angel L. Mateo wrote:
This is the logs registered with an increased log level: Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 14 Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 31 Jul 6 09:01:21 canis4 slapd[10389]: daemon: read activity on 44 Jul 6 09:01:21 canis4 slapd[10389]: connection_read(44): input error=-2 id=84, closing. Jul 6 09:01:21 canis4 slapd[10389]: daemon: removing 44 Jul 6 09:01:21 canis4 slapd[10389]: conn=84 fd=44 closed (connection lost)
Can you turn on some more verbose logging on the client-side to see if it thinks there was a bad response, or it's sent a query that is unanswered, etc?
If not, it sounds like it's time to get even more basic and look lower down in the stack. Here's some of the network-level debugging it sounds like you might have to start doing:
You should also have a connection log line like:
Jul 6 09:01:00 canis4 slapd[10389]: conn=84 fd=44 ACCEPT from IP= $client_ip:$port (IP=0.0.0.0:389)
Start up a tcpdump on both the client and the server and watch for the problem to recur. Compare the tcpdump output files looking for the newly logged source $port to see that either the client and the server are not seeing the same packets or that the client is doing something strange (like actually not sending output or closing the connection).
Your original message also mentioned some other changes:
We have a clustered openldap server (slapd version 2.2.23 [...snip...]
But now we are renovating our servers and we want to upgrade them to a new one based on openldap 2.3.30 (in two debian etch servers), but we are having problems.
I'm not sure what your "clustered" architecture is, are you using a load-balancing or other layer-3 directing front-end? If so, is that exactly the same front-end on both the production and test set-ups? Regardless, you would need to add those tcpdumps to both sides of the load-directing box as well to make sure it's not behaving differently in some way.
-philip
El vie, 06-07-2007 a las 10:00 -0500, Philip Kizer escribió:
I'm not sure what your "clustered" architecture is, are you using a load-balancing or other layer-3 directing front-end? If so, is that exactly the same front-end on both the production and test set-ups? Regardless, you would need to add those tcpdumps to both sides of the load-directing box as well to make sure it's not behaving differently in some way.
We are using a load balancer that distribute requests to 2 different ldap servers. The balance algorithm is with a hash of the source, that is, all requests from the same source are directed to the same server. The configuration in the production and the test environments is the same.
Hello,
After some tests with openldap, we have studied the only difference between production and new servers: the new ones are installed as a vm in a xen virtualized server. So we have installed a new server (as a real server, without xen) and it seems to work with the production load. So we are thinking now that it could be a network performance problem related with xen (although we have other heavily load servers running under xen without any apparent problem).
On 7/9/07, Angel L. Mateo amateo@um.es wrote:
Hello,
After some tests with openldap, we have studied the only difference
between production and new servers: the new ones are installed as a vm in a xen virtualized server. So we have installed a new server (as a real server, without xen) and it seems to work with the production load. So we are thinking now that it could be a network performance problem related with xen (although we have other heavily load servers running under xen without any apparent problem).
Actually, I'm not sure if this xen stuff was ever resolved: http://www.openldap.org/lists/openldap-software/200603/msg00214.html
I'd be interested in an update on any other xen experiences people have had. (you didn't mention if you were using linux, netbsd, or some other xen-support arch)
Hello,
Could it be a problem related with the architecture? I had the servers installed in a xen environment over amd64 servers (debian etch amd64). I have moved the servers with the same schema, that is, running as vm in xen servers, but over intel processors (debian etch i386) and the problems has gone away.
It is difficult to understand for me such a low-level dependency in openldap or dbd, but with this I has solved our problems.
openldap-software@openldap.org