Full_Name: Howard Chu
OS: Solaris 10
Submission from: (NULL) (126.96.36.199)
Submitted by: hyc
test050 hung on me after some number of iterations. Unfortunately I didn't save
the stack traces, but basically there was one thread waiting in send_ldap_ber()
on the write2 cv, and another thread in config_back_add() waiting for a pool
pause to succeed. netstat showed that no connections had queued data, so there
should have been no reason for the writer to still be waiting.
I believe what happened here is that while the writer was waiting (it was a
syncprov qtask replaying events for a psearch) the psearch connection got
closed. Solaris is using select, and select() doesn't specially distinguish
socket close events - they're reported as read events. The deadlock is because
we queue read events into the thread pool, and we don't discover they're
actually closed sockets until the read thread gets to run and tries to read from
the socket (and gets zero bytes back). But since the pool is entering a pause,
the reader thread cannot run, so it can't detect the hangup and dispose of the
The ideal fix for this is to process hangup events inline in the listener thread
instead of pushing them into the thread pool. But that requires being able to
cheaply determine that a hangup actually occurred, and select() doesn't give us
We could get this info using poll() instead. Since nowadays any POSIX platform
that implements select() also implements poll() we can probably just switch to
poll() and drop select(). One exception is Windows; Winsock only supports poll()
on Windows Vista and newer.
(Note, we had a patch that added a connection_hangup() handler for Linux epoll()
at one point, but I dropped it later because it seemed to have strange
interactions with Samba. Should look into resurrecting it again.)
I don't think we can really fix this issue without knowing for certain when
hangup events occur. If we're forced to keep using select, that implies that the
main listener thread must attempt a read on the socket before deciding how to
dispatch the connection. Any thoughts?
Full_Name: Rodrigo Luiz Vargas Costa
OS: CentOS release 5.2 (Final)
Submission from: (NULL) (188.8.131.52)
I have being exchange some information at openldap lists where looks like some
improvements are being done in replication for release 2.4.18.
The architecture I'm running has 2 machines in MirrorMode in the same subnet(at
the same switch). These systems are part of a HA system sharing a VIP and where
both machines have slapd running simultaneously(bind to any local interface) and
only VIP is exchanged for HA purposes.
The issue I'm facing is related, in a general user view, is when I stop the
secondary Provider2(master 2) for backup purposes using slapcat. The
Provider1(master 1) continues to provide ldap service where some entrances can
be created during the time backup is running(no consumer from Provider 2).
Even a small number of entrances are different when consumer in Provider 2
connects to Provider 1 then syncrepl enters in the full DB search as expected.
For definition purposes I have some memory limitations where I need to limit
dncachesize for around 80% of DB entrances.
>From a user perspective I see that after cache is filled system enters in some
state where synchronization doesn't happen anymore. For full reference(config,
gdb, etc), please see file attached in FTP.
Then I see 2 issues :
1)Consumer from Provider2, even passed days and only a small number of
differences for test purpose happen(no traffic), the syncrepl never ends and
there isn't replication(Provider 1 stay continuously consuming 100% CPU);
2)Even I stop the Provider2(then its consumer) I do not see any change in
Provider 1 activities. The CPU continues in 100% even passed days what suggest
some hang in the thread or logic.
I compiled openldap with GDB symbols and then execute some traces in the threads
during the state 2 report above. Looks like it stay looping forever locked in
some thread lock.
I could also note that when in this situation the monitor cache, in a very slow
pace, changes the cache in a single entrance. Being more specific :
dn: cn=Database 1,cn=Databases,cn=Monitor
entryDN: cn=Database 1,cn=Databases,cn=Monitor
Stays running in the values 3896287 and 3896288. Looks like the memory re-use is
being too short causing locks that takes long time causing a non
I made several GDB traces for different conditions. Please see ftp attachment
file for details.
PS-> I could not put the file in the openldap ftp. It says device full. Please
let me know how can I send this file.
>> masarati(a)aero.polimi.it wrote:
>> I'd appreciate it very much if it would be exactly behave in the same way
>> all other string-valued options.
> On a somewhat related issue, I note that LDAP_OPT_X_SASL_MECHLIST returns
> a pointer to an array of chars that apparently cannot be mucked with.
> Assuming my understanding is correct, I wonder if this behavior is
> desirable or not, given the fact that if another mech is added, e.g. by
> adding a dynamic module, I expect this list to change.
These are SASL mechs with the plugin modules. Right?
>From an operational standpoint: If a SASL plugin module for a mech was added I
think it's acceptable that a software which queries this option is restarted
before this SASL mech is known to the software. Probably one has to add
additional configuration for this SASL mech.
Now the question is what happens if a SASL plugin module is removed and the
software trys to use the removed SASL mech. Clearly removing plugin modules in
a running system is asking for trouble anyway...
Having said this I would not care too much about this list going to change...
What does this look with top and/or in dmesg over the run time? Is it a
simple out-of-memory? Definitely a bit of a gross method, but what's ls
-lh core show for size?
(If so, is it warranted given your load or is there a leak, etc etc...and
of course make sure you're up to date on the surrounding packages,
OpenLDAP isn't the only thing that can leak.)
> clem.oudot(a)gmail.com wrote:
>> I have a rootdn. An extract of my slapd.conf is :
>> database bdb
>> suffix dc=3Dexample,dc=3Dcom
>> rootdn cn=3Dmanager,dc=3Dexample,dc=3Dcom
>> rootpw secret
>> directory /var/lib/ldap
>> overlay ppolicy
>> overlay unique
>> unique_uri ldap:///ou=3Dusers,dc=3Dexample,dc=3Dcom?uid?sub?(objectClass=3D=
> Could you please repost this without the broken quoted printables? Especially
> the filter part of the value for 'unique_uri'.
I suspect the ITS software messed this up because the direct Cc:-ed messages
to me did not contain the messed up quoted printables. So here's what Clément
orginally sent as excerpt of his config:
> I have a rootdn. An extract of my slapd.conf is :
> database bdb
> suffix dc=3Dexample,dc=3Dcom
> rootdn cn=3Dmanager,dc=3Dexample,dc=3Dcom
> rootpw secret
> directory /var/lib/ldap
> overlay ppolicy
> overlay unique
> unique_uri ldap:///ou=3Dusers,dc=3Dexample,dc=3Dcom?uid?sub?(objectClass=3D=
Could you please repost this without the broken quoted printables? Especially
the filter part of the value for 'unique_uri'.
Full_Name: Aaron Richton
OS: Solaris 9
Submission from: (NULL) (184.108.40.206)
t@5 (l@5) terminated by signal SEGV (no mapping at the fault address)
Current function is connection_abandon
729 op.orn_msgid = o->o_msgid;
current thread: t@5
=> connection_abandon(c = 0x106df088), line 729 in "connection.c"
 connection_closing(c = 0x106df088, why = 0x2859e0 "connection lost"), line
777 in "connection.c"
 connection_read(s = 11, cri = 0xfd3ffd64), line 1427 in "connection.c"
 connection_read_thread(ctx = 0xfd3ffe0c, argv = 0xb), line 1245 in
 ldap_int_thread_pool_wrapper(xpool = 0x10491f08), line 685 in "tpool.c"
(dbx) print o->o_hdr
o->o_hdr = 0xdeadbeef
Full backtrace in ITS link. testrun directory: