Full_Name: Rein Tollevik
Version: CVS head
OS: linux and solaris
URL: ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch
Submission from: (NULL) (81.93.160.250)
I was seeing random failures of the test050-syncrepl-multimaster test. One of
the failures was that it went into a tight loop traversing a circular runqueue
it had managed to create in slapd_rq.task_list. It seems as this was caused by
missing mutex locks around accesses to slapd_rq, which the patch uploaded to
ftp://ftp.openldap.…
[View More]org/incoming/slapd_rq_lock.patch fixes.
Before I applied this patch the test failed after being run a few times, with it
it has now passed 100 times and is still counting.
Rein Tollevik
Basefarm AS
[View Less]
Full_Name: Rein Tollevik
Version: CVS head
OS: linux and solaris
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (81.93.160.250)
Below is a minor enhancement to the auditlog overlay, don't fetch the time()
until it is known that it is needed. Which it isn't on all those search
responses that passes through it..
Rein Tollevik
Basefarm AS
Index: OpenLDAP/servers/slapd/overlays/auditlog.c
diff -u OpenLDAP/servers/slapd/overlays/auditlog.c:1.1.1.8
OpenLDAP/servers/slapd/overlays/…
[View More]auditlog.c:1.2
--- OpenLDAP/servers/slapd/overlays/auditlog.c:1.1.1.8 Sat Mar 22 16:48:17 2008
+++ OpenLDAP/servers/slapd/overlays/auditlog.c Wed Mar 26 22:58:55 2008
@@ -74,7 +74,7 @@
Modifications *m;
struct berval *b, *who = NULL;
char *what, *suffix;
- long stamp = slap_get_time();
+ time_t stamp;
int i;
if ( rs->sr_err != LDAP_SUCCESS ) return SLAP_CB_CONTINUE;
@@ -125,6 +125,7 @@
return SLAP_CB_CONTINUE;
}
+ stamp = slap_get_time();
fprintf(f, "# %s %ld %s%s%s\n",
what, stamp, suffix, who ? " " : "", who ? who->bv_val : "");
[View Less]
ando(a)sys-net.it wrote:
[about REP_ENTRY_MUSTRELEASE]
> it is not
> clear what happens when a callback chain is interrupted by
> slap_null_cb() or similar, without getting to slap_send_search_entry().
> This seems to indicate that callbacks should always provide a last
> resort means to release the resources they set; if read-only, by keeping
> track of what they sent; if modifiable, by freeing the contents of
> rs->sr_* if not NULL, setting it to NULL to prevent …
[View More]further cleanup.
That sounds cumbersome, I hope slapd could take care of that somehow.
But I don't see how the be_release() code can work now. It sounds like
be->be_release() functions must check (how?) that the entry was created
by 'be', and otherwise pass it on to the next overlay/backend or
otherwise to entry_free(). Might involve mucking with op->o_bd and
sr_entry->e_private, I suppose. Except maybe I'm missing some existing
magic since slapd doesn't regularly crash...
be->be_release() does receive entries that were not created by 'be' or
at least not with be->be_fetch(), see openldap-devel thread 'slapd API'
in mar 2008.
> Similarly, the existence of REP_ENTRY_MUSTBEFREED is not totally clear:
> in principle as soon as REP_ENTRY_MODIFYABLE is set, it should imply
> REP_ENTRY_MUSTBEFREED; the only difference in the semantics of the two
> is that REP_ENTRY_MODIFYABLE without REP_ENTRY_MUSTBEFREED implies that
> the callback that set the former will take care of freeing the entry;
> however, other callbacks may further modify it, so freeing temporary
> data should probably be left to the final handler.
That's not my impression. MODIFIABLE would be that other modules than
the creator can modify the entry - but the creator might still be the
one who will free it. MUSTBEFREED is that the entry must be
entry_free()ed - the creator will not do it (or not do it unless that
flag is set).
So, if I'm getting this right...
A backend must expect an entry to change or be freed if it sends the
entry with REP_ENTRY_<MUSTRELEASE, MUSTBEFREED or MODIFIABLE>, or if
it passes through a slap_callback.sc_response.
back-ldif does not: it uses e_name/e_nname _after_ sending with with
REP_ENTRY_MODIFIABLE.
Nor overlay retcode, it calls retcode_entry_response(,,,rs->sr_entry)
which makes pointers into sr_entry without checking if those flags are
set. If I'm getting that code correctly. I haven't tested.
Others apparent problems (also not tested, I've just browsed the code):
Overlays that obey and reset MUSTBEFREED or MUSTRELEASE, do not
necessarily clear or set MODIFIABLE when setting a new entry.
translucent does not, even though it is a careful one which
obeys both MUSTBEFREED and MUSTRELEASE.
I'm not sure when the code must clear the entry_related flags.
Some code which sets new entries seem to assume they are zero,
but some code which sets sr_entry=NULL does not clear the flags.
There are sr_flags bits which are not about the entry, in particular
REP_MATCHED_MUSTBEFREED and REP_MATCHED_MUSTBEFREED. Looks like these
flags can get lost when but some code sets sr_flags = <something>
rather sr_flags |= <something> or sr_flags ~= ~<something>.
--
Hallvard
[View Less]
--On Thursday, March 27, 2008 10:44 PM +0000 bdb(a)oss.oracle.com wrote:
> I know you're not using BDB 4.6 with OpenLDAP 2.3, so this patch isn't
> strictly relevant to this ITS. It just raises the possibility that there
> is a BDB bug involved.
Looking at the code for 4.2.52, the mp_fopen.c code is essentially
identical to what this patch fixes, so I'd guess it could affect 2.3
releases using bdb 4.2->4.5. And I'm guessing there'll be no patch from
Oracle for those much …
[View More]older releases.
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra :: the leader in open source messaging and collaboration
[View Less]
I know you're not using BDB 4.6 with OpenLDAP 2.3, so this patch isn't
strictly relevant to this ITS. It just raises the possibility that there is a
BDB bug involved.
-------- Original Message --------
Subject: [Berkeley DB Announce] Berkeley DB 4.6.21.1 - Patch Announcement
Date: Thu, 27 Mar 2008 18:03:40 -0400
From: bdb(a)oss.oracle.com
Reply-To: bdb(a)oss.oracle.com
Organization: Oracle Corporation
To: <bdb(a)oss.oracle.com>
A patch for Berkeley DB release 4.6.21 that fixes a race …
[View More]condition between
checkpoint and DB->close() which can result in the checkpoint thread
self-deadlocking (SR [#15692]) is now available. We recommend all users of
Berkeley DB apply this patch so as to avoid this issue in production.
An web page with all patches (currently only one) for Berkeley DB 4.6.21.
http://www.oracle.com/technology/products/berkeley-db/xml/update/4.6.21/patc
h.4.6.21.html
A text file with the patch:
http://www.oracle.com/technology/products/berkeley-db/xml/update/4.6.21/patc
h.4.6.21.1
regards, the Berkeley DB Team,
-greg
_____________________________________________________________________
Gregory Burd greg.burd(a)oracle.com
Product Manager, Berkeley DB/JE/XML Oracle Corporation
_______________________________________________
BDB mailing list
BDB(a)oss.oracle.com
http://oss.oracle.com/mailman/listinfo/bdb
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
ali.pouya(a)free.fr wrote:
> Full_Name: Ali Pouya
> Version: 2.4.8
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (145.242.11.3)
>
>
> Hi,
> I'm testing the back-meta of OpenLdap 2.4.8 in order to upgrade to this
> release.
> My client opens a persistent LDAP connection to slapd, binds with an ordinary
> account (from the local back-bdb) and begins search operations.
>
> I notice that if a search operation results …
[View More]in an LDAP error (for example error
> 32 : No Such Object), then back-meta opens a new connection to the target
> directory for the next operation, leaving the "bad" connection open.
>
> The conn-ttl and idle-timeout parameters do not close the "bad" connections.
> These connections remain there until the client ends its connection.
>
> This saturates the target servers with unused idle connections.
>
> Is this a bug or a normal behaviour ?
>
> I precise that The behaviour is not the same if the client binds with rootdn.
1) Can you post your slapd.conf (sanitized)?
2) What is exactly causing the error? What error is actually being
returned from all targets? For back-meta, no such object has some
special semantics.
Best would be if you can provide a setup that allows to reproduce the
critical situation with a very simple test, involving as little targets
as possible, along with the LDIF required to populate each target and
the operation you used.
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
[View Less]
Full_Name: Ali Pouya
Version: 2.4.8
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (145.242.11.3)
Hi,
I'm testing the back-meta of OpenLdap 2.4.8 in order to upgrade to this
release.
My client opens a persistent LDAP connection to slapd, binds with an ordinary
account (from the local back-bdb) and begins search operations.
I notice that if a search operation results in an LDAP error (for example error
32 : No Such Object), then back-meta opens a new connection to …
[View More]the target
directory for the next operation, leaving the "bad" connection open.
The conn-ttl and idle-timeout parameters do not close the "bad" connections.
These connections remain there until the client ends its connection.
This saturates the target servers with unused idle connections.
Is this a bug or a normal behaviour ?
I precise that The behaviour is not the same if the client binds with rootdn.
Thanks for your help
Best regards
Ali
[View Less]
Rein Tollevik wrote:
> On Tue, 25 Mar 2008, hyc(a)symas.com wrote:
> It now looks to me as if the entire rs->sr_entry was released an reused,
> and that the bug probably is in back-bdb. It just always happened when
> syncprov was running as this is a master server with mostly writes
> except from the reads syncprov does. Which also means that the title on
> the bug report is probably misleading :-(
>
> The reason for my shift of focus is some highly suspicious …
[View More]variables
> found in the bdb_search() frame.
Look at the value of idflag in bdb_search. (see the comment around 672)
Could be a locking bug with ID_NOCACHE in cache.c...
> First, ei->bei_e seem to be NULL every time this happen, not the same as
> the "e" variable as I would except from looking at the code. Second,
> e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or
> less completely initialized) entry following the *ei entry in the
> database, while I would expect that ei->bei_id==e->e_id and that
> ei->bei_e==e. I don't think the consecutive ID numbers or entries has
> anything with the bug to do though, they are probably a result of the
> searches reading the entire database.
>
> I looks to me as if the EntryInfo referred to by ei was released somehow
> when slap_send_search_entry() was running, and that the ei->bei_e passed
> as rs->rs_entry was first released and then reused for the next entry.
> Alternatively that the structures wasn't properly initialized and
> thereby appeared free to the second thread.
>
> I have found a core file that adds to my suspicions, this time with two
> active threads. Their stack frames are shown at the end. In this core
> file the id variable in the bdb_search() frame is 9318, the second
> thread is loading id==9319. The value of the "e" variable in
> bdb_search() for the first thread equals the "x" variable of the
> entry_decode() frame in the second thread. Since entry_decode() has
> just allocated its "x" value it shouldn't have found the same value as
> the other thread has unless someone released its "e" entry while it was
> still using it. Why this happened I haven't figured out yet.
>
> The cachsesize directive has its default value, which is way below the
> optimal value. The result is that the cache is always full. I don't
> know if that can have any influence on the bug? I tried to lower its
> value to see if that should help reproducing the problem, but not so
> far.
>
>>> I'm currently trying to gather more information related to this bug, any
>>> pointers as to what I should look for is appreciated. I'm posting this bug
>>> report now in the hope that the stack frame should enlighten someone with better
>>> knowledge of the code than what I have.
>> Check for stack overruns, compile without optimization and make sure it's not
>> a compiler optimization bug, etc.
>
> I'll try to run it with a version compiled with only -g and see if the
> same thing should happen then.
>
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
[View Less]
On Tue, 25 Mar 2008, hyc(a)symas.com wrote:
> rein(a)basefarm.no wrote:
>> We have bin hit by what looks like a race condition bug in syncprov. We got
>> some core dumps all showing stack frames like the one at the end. As such nasty
>> bugs tends to do it have behaved OK after I restarted slapd with more debug
>> output :-( (trace + stats + stats2 + sync).
>>
>>
>> Note the a=0xBAD argument to attr_find(), which I expect is the result of some
&…
[View More]gt;> other thread freeing the attribute list it was called with while it was
>> processing it. The rs->sr_entry->e_attrs argument passed to attr_find() as the
>> original "a" argument by findpres_cb() looks like a perfectly valid structure,
>> as are all the attributes found by following the a_next pointer. The list is
>> terminated by an attribute with a NULL a_next value, none of the a_next values
>> are 0xBAD.
>
> I don't believe that's the cause. Notice that arg0 in stack frame #9 is also
> 0xbad, even though it is shown correctly in frames 8 and 10. Something else is
> going on.
I don't know how I managed to overlook the strange 0xbad argument in
frame #9, and I really haven't any explanation as to why it shows that
value. But if I am on the right track in my assumptions below I
currently writes is off as an unexpected result of optimization.
It now looks to me as if the entire rs->sr_entry was released an reused,
and that the bug probably is in back-bdb. It just always happened when
syncprov was running as this is a master server with mostly writes
except from the reads syncprov does. Which also means that the title on
the bug report is probably misleading :-(
The reason for my shift of focus is some highly suspicious variables
found in the bdb_search() frame.
First, ei->bei_e seem to be NULL every time this happen, not the same as
the "e" variable as I would except from looking at the code. Second,
e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or
less completely initialized) entry following the *ei entry in the
database, while I would expect that ei->bei_id==e->e_id and that
ei->bei_e==e. I don't think the consecutive ID numbers or entries has
anything with the bug to do though, they are probably a result of the
searches reading the entire database.
I looks to me as if the EntryInfo referred to by ei was released somehow
when slap_send_search_entry() was running, and that the ei->bei_e passed
as rs->rs_entry was first released and then reused for the next entry.
Alternatively that the structures wasn't properly initialized and
thereby appeared free to the second thread.
I have found a core file that adds to my suspicions, this time with two
active threads. Their stack frames are shown at the end. In this core
file the id variable in the bdb_search() frame is 9318, the second
thread is loading id==9319. The value of the "e" variable in
bdb_search() for the first thread equals the "x" variable of the
entry_decode() frame in the second thread. Since entry_decode() has
just allocated its "x" value it shouldn't have found the same value as
the other thread has unless someone released its "e" entry while it was
still using it. Why this happened I haven't figured out yet.
The cachsesize directive has its default value, which is way below the
optimal value. The result is that the cache is always full. I don't
know if that can have any influence on the bug? I tried to lower its
value to see if that should help reproducing the problem, but not so
far.
>> I'm currently trying to gather more information related to this bug, any
>> pointers as to what I should look for is appreciated. I'm posting this bug
>> report now in the hope that the stack frame should enlighten someone with better
>> knowledge of the code than what I have.
>
> Check for stack overruns, compile without optimization and make sure it's not
> a compiler optimization bug, etc.
I'll try to run it with a version compiled with only -g and see if the
same thing should happen then.
Rein
Acive thread when segmentation fault occured:
#0 0x0807d03a in attr_find (a=0xbad, desc=0x81e8680) at attr.c:665
#1 0xb7a656f6 in findpres_cb (op=0xac05cba4, rs=0xac05cb68) at syncprov.c:546
#2 0x0808416d in slap_response_play (op=0xac05cba4, rs=0xac05cb68) at result.c:307
#3 0x0808555b in slap_send_search_entry (op=0xac05cba4, rs=0xac05cb68) at result.c:770
#4 0x080f2cdc in bdb_search (op=0xac05cba4, rs=0xac05cb68) at search.c:870
#5 0x080db72b in overlay_op_walk (op=0xac05cba4, rs=0xac05cb68, which=op_search, oi=0x8274218, on=0x8274318) at backover.c:653
#6 0x080dbcaf in over_op_func (op=0xac05cba4, rs=0xac05cb68, which=op_search) at backover.c:705
#7 0x080dbdef in over_op_search (op=0xac05cba4, rs=0xac05cb68) at backover.c:727
#8 0x080d9570 in glue_sub_search (op=0xac05cba4, rs=0xac05cb68, b0=0xac05cba4, on=0xac05cba4) at backglue.c:340
#9 0x080da131 in glue_op_search (op=0xbad, rs=0xac05cb68) at backglue.c:459
#10 0x080db6d5 in overlay_op_walk (op=0xac05cba4, rs=0xac05cb68, which=op_search, oi=0x8271860, on=0x8271a60) at backover.c:643
#11 0x080dbcaf in over_op_func (op=0xac05cba4, rs=0xac05cb68, which=op_search) at backover.c:705
#12 0x080dbdef in over_op_search (op=0xac05cba4, rs=0xac05cb68) at backover.c:727
#13 0xb7a65ff4 in syncprov_findcsn (op=0x85085b8, mode=FIND_PRESENT) at syncprov.c:700
#14 0xb7a670a0 in syncprov_op_search (op=0x85085b8, rs=0xac05e1c0) at syncprov.c:2277
#15 0x080db6d5 in overlay_op_walk (op=0x85085b8, rs=0xac05e1c0, which=op_search, oi=0x8271860, on=0x8271b60) at backover.c:643
#16 0x080dbcaf in over_op_func (op=0x85085b8, rs=0xac05e1c0, which=op_search) at backover.c:705
#17 0x080dbdef in over_op_search (op=0x85085b8, rs=0xac05e1c0) at backover.c:727
#18 0x08076554 in fe_op_search (op=0x85085b8, rs=0xac05e1c0) at search.c:368
#19 0x080770e4 in do_search (op=0x85085b8, rs=0xac05e1c0) at search.c:217
#20 0x08073e28 in connection_operation (ctx=0xac05e2b8, arg_v=0x85085b8) at connection.c:1084
#21 0x08074f14 in connection_read_thread (ctx=0xac05e2b8, argv=0x3d) at connection.c:1211
#22 0xb7fb5546 in ldap_int_thread_pool_wrapper (xpool=0x81ee240) at tpool.c:663
#23 0xb7c80371 in start_thread () from /lib/tls/libpthread.so.0
#24 0xb7c17ffe in clone () from /lib/tls/libc.so.6
Thead loading the next entry:
#0 0x08123ef6 in avl_find (root=0x81e7940, data=0xae7a62e8, fcmp=0x80b8871 <attr_index_name_cmp>) at avl.c:541
#1 0x080b7852 in at_bvfind (name=0xae7a62e8) at at.c:130
#2 0x080b5fd8 in slap_bv2ad (bv=0xae7a6354, ad=0xae7a6360, text=0xae7a6364) at ad.c:189
#3 0x0807eef5 in entry_decode (eh=0xae7a63bc, e=0xae7a64a8) at entry.c:859
#4 0x0810d849 in bdb_id2entry (be=0xae8666c0, tid=0x0, locker=0xb742da7c, id=9319, e=0xae7a64a8) at id2entry.c:169
#5 0x08104ffb in bdb_cache_find_id (op=0xae866ba4, tid=0x0, id=9319, eip=0xae7a6614, flag=2, locker=0xb742da7c, lock=0xae7a65f4) at cache.c:834
#6 0x080f2682 in bdb_search (op=0xae866ba4, rs=0xae866b68) at search.c:684
#7 0x080db72b in overlay_op_walk (op=0xae866ba4, rs=0xae866b68, which=op_search, oi=0x8274218, on=0x8274318) at backover.c:653
#8 0x080dbcaf in over_op_func (op=0xae866ba4, rs=0xae866b68, which=op_search) at backover.c:705
#9 0x080dbdef in over_op_search (op=0xae866ba4, rs=0xae866b68) at backover.c:727
#10 0x080d9570 in glue_sub_search (op=0xae866ba4, rs=0xae866b68, b0=0x81e7930, on=0xae866ba4) at backglue.c:340
#11 0x080da131 in glue_op_search (op=0xfffffffa, rs=0xae866b68) at backglue.c:459
#12 0x080db6d5 in overlay_op_walk (op=0xae866ba4, rs=0xae866b68, which=op_search, oi=0x8271860, on=0x8271a60) at backover.c:643
#13 0x080dbcaf in over_op_func (op=0xae866ba4, rs=0xae866b68, which=op_search) at backover.c:705
#14 0x080dbdef in over_op_search (op=0xae866ba4, rs=0xae866b68) at backover.c:727
#15 0xb7a65ff4 in syncprov_findcsn (op=0x8508e60, mode=FIND_PRESENT) at syncprov.c:700
#16 0xb7a670a0 in syncprov_op_search (op=0x8508e60, rs=0xae8681c0) at syncprov.c:2277
#17 0x080db6d5 in overlay_op_walk (op=0x8508e60, rs=0xae8681c0, which=op_search, oi=0x8271860, on=0x8271b60) at backover.c:643
#18 0x080dbcaf in over_op_func (op=0x8508e60, rs=0xae8681c0, which=op_search) at backover.c:705
#19 0x080dbdef in over_op_search (op=0x8508e60, rs=0xae8681c0) at backover.c:727
#20 0x08076554 in fe_op_search (op=0x8508e60, rs=0xae8681c0) at search.c:368
#21 0x080770e4 in do_search (op=0x8508e60, rs=0xae8681c0) at search.c:217
#22 0x08073e28 in connection_operation (ctx=0xae8682b8, arg_v=0x8508e60) at connection.c:1084
#23 0x08074f14 in connection_read_thread (ctx=0xae8682b8, argv=0x54) at connection.c:1211
#24 0xb7fb5546 in ldap_int_thread_pool_wrapper (xpool=0x81ee240) at tpool.c:663
#25 0xb7c80371 in start_thread () from /lib/tls/libpthread.so.0
#26 0xb7c17ffe in clone () from /lib/tls/libc.so.6
[View Less]