Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by ando@sys-net.it
hyc(a)symas.com wrote:
> ando(a)sys-net.it wrote:
>> ando(a)sys-net.it wrote:
>>> mohel(a)web.de wrote:
>>>
>>>> The above log was generated using the test050 script just adding a line which
>>>> enters information also to a second server. Perhaps this test would also make
>>>> sense for further releases.
>>> Can you please post the modification to the script?
>> Never mind, I could reproduce it. It seems to be a duplicate of
>> ITS#5437. What actually happens is that if modifications are applied to
>> all three servers, those applied to consumers do not persist, nor get
>> replicated; on the contrary, the consumers crash later on as described
>> in ITS#5437.
>
> #5437? What you're talking about doesn't seem to match any text in ITS#5437.
Well, sorry for not providing more details, I was going to investigate
it a little bit more but I had to leave. The crash to me occurs in
syncprov_done_ctrl(), as indicated in ITS#5437. What happens is that
syncprov_done_ctrl() with an invalid cookie (unset; clearing it in
syncprov_search_response(), as suggested in ITS#5437 leads to the
behavior I described (writes to consumers are simply ignored, don't
persist in the consumers and are not replicated). To reproduce, add
something to a consumer close to the end of test050, before killing the
servers, and wait a little bit to let things proceed to the crash.
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
15 years, 5 months
Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by hyc@symas.com
ando(a)sys-net.it wrote:
> ando(a)sys-net.it wrote:
>> mohel(a)web.de wrote:
>>
>>> The above log was generated using the test050 script just adding a line which
>>> enters information also to a second server. Perhaps this test would also make
>>> sense for further releases.
>> Can you please post the modification to the script?
>
> Never mind, I could reproduce it. It seems to be a duplicate of
> ITS#5437. What actually happens is that if modifications are applied to
> all three servers, those applied to consumers do not persist, nor get
> replicated; on the contrary, the consumers crash later on as described
> in ITS#5437.
#5437? What you're talking about doesn't seem to match any text in ITS#5437.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
15 years, 5 months
Re: (ITS#5442) slapd_rq not locked before use bugfix
by hyc@symas.com
Rein Tollevik wrote:
> On Sun, 30 Mar 2008, hyc(a)symas.com wrote:
>
>> rein(a)tollevik.no wrote:
>>> On Sat, 29 Mar 2008, ando(a)sys-net.it wrote:
>>>> rein(a)basefarm.no wrote:
>
>>>>> I was seeing random failures of the test050-syncrepl-multimaster test. One of
>>>>> the failures was that it went into a tight loop traversing a circular runqueue
>>>>> it had managed to create in slapd_rq.task_list. It seems as this was caused by
>>>>> missing mutex locks around accesses to slapd_rq, which the patch uploaded to
>>>>> ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes.
>>>>>
>>>>> Before I applied this patch the test failed after being run a few times, with it
>>>>> it has now passed 100 times and is still counting.
>>>> locks in back-bdb/config.c should be pointless, as modifications to the
>>>> configuration should only occur while all threads are paused. The rest
>>>> makes sort of sense, but I'd leave it to Howard.
>> Ignoring the ITS#5403 changes, I don't see anything here that isn't
>> config-related, therefore it's all running single-threaded.
>
> Now that the configuration can be changed dynamically (as this test does)
> I find it a bit odd that the config stuff should always be running
> single-threaded. But there is obviously much I don't know about the
> internals of slapd.
This was discussed at the 2003 OpenLDAP Developers' Day.
http://www.openldap.org/conf/odd-wien-2003/proceedings.html
Page 7 of my slides touched on it.
It's also mentioned again here
http://www.openldap.org/lists/openldap-devel/200505/msg00062.html
The main point is that the original config stuff assumed that it was only
executing during slapd startup, and single-threaded. Putting locks around all
of the potential configurable value accesses would have been too much work, so
the decision was made to force slapd to be single-threaded whenever writing to
the config.
I'm guessing that what's happening here is just a broken assumption -
suspending the thread pool doesn't in fact freeze everything in slapd. In
particular, the slapd listener thread may still wake up for a select() timeout
to handle the slapd runqueue. Even though no tasks submitted as a result of
this will execute (because they'll just get queued into the suspended thread
pool) the runqueue itself is still being manipulated.
Given that explanation, your patches make sense.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
15 years, 5 months
Re: (ITS#5439) back-bdb race condition
by quanah@zimbra.com
--On Monday, March 31, 2008 6:27 PM +0000 rein(a)basefarm.no wrote:
> We did see segmentation faults both Friday and Monday (not very much
> going on during the weekend..), but so far those appear to be related to
> ITS#5445. A slapd and db_checkpoint (run out of cron) deadlock which I
> hope is due to the BerkeleyDB bug mentioned in ITS#5391 has also
> occurred, so I'll upgrade from 4.6.18 to 4.6.21.1 to see if that problem
> goes away.
Why do you run db_checkpoint out of cron? Slapd will already take care of
checkpointing the database for you using the "checkpoint" directive...
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra :: the leader in open source messaging and collaboration
15 years, 5 months
Re: (ITS#5439) back-bdb race condition
by rein@basefarm.no
On Tue, 25 Mar 2008, Howard Chu wrote:
> Rein Tollevik wrote:
>> On Tue, 25 Mar 2008, hyc(a)symas.com wrote:
>> It now looks to me as if the entire rs->sr_entry was released an reused,
>> and that the bug probably is in back-bdb. It just always happened when
>> syncprov was running as this is a master server with mostly writes
>> except from the reads syncprov does. Which also means that the title on
>> the bug report is probably misleading :-(
>>
>> The reason for my shift of focus is some highly suspicious variables
>> found in the bdb_search() frame.
>
> Look at the value of idflag in bdb_search. (see the comment around 672)
>
> Could be a locking bug with ID_NOCACHE in cache.c...
Yes, it looks very much as this was the problem. We have been running
with the patch at the end since thursday afternoon, and so far it seem
to have worked :-) Although it is a bit too early to conclude,
especially as we have seen other problems as well.
We did see segmentation faults both friday and monday (not very much
going on during the weekend..), but so far those appear to be related to
ITS#5445. A slapd and db_checkpoint (run out of cron) deadlock which I
hope is due to the BerkeleyDB bug mentioned in ITS#5391 has also
occurred, so I'll upgrade from 4.6.18 to 4.6.21.1 to see if that problem
goes away.
Now, back to this bug. There is a potential bug in bdb_cache_find_id()
as it inspects and modifies the EntryInfo without it being locked.
Although I don't think this was the cause here, as bdb_search() always
passes a NULL eip which mean that bdb_cache_find_ndn() should have
acquired a lock. The patch always locks the entryinfo though, as I
believe that to be necessary for the rest of the patch to be correct.
The real problem was in bdb_cache_return_entry_rw() which released
ei->bei_e without making sure it was the only thread using it. I.e it
should have had a write lock on the entry_db (as bdb_cache_lru_purge()
makes sure to have), but instead it freed it after releasing the read
lock.
Instead of write locking the entry before releasing it my alternative
solution is to only let the thread that loads the entry set the
CACHE_ENTRY_NOT_CACHED flag, and only if no other threads found the
entryinfo while it loaded the entry. All other threads clears the flag
when they find they don't need to load the entry.
I also use trylock to lock the entryinfo before setting or inspecting
the cache flag, for two reasons. First, honoring this flag should not
be important enough to wait for a lock (after all, it should be cleared
anyhow if anyone else is interested in the same entry). Second (and
most important), to avoid any deadlock situations, which my first
attempt to fix this bug managed to create when I moved the release of
the entryinfo lock to after the entry_db lock had been acquired in
bdb_cache_find_id().
Rein
Index: OpenLDAP/servers/slapd/back-bdb/cache.c
diff -u OpenLDAP/servers/slapd/back-bdb/cache.c:1.1.1.15 OpenLDAP/servers/slapd/back-bdb/cache.c:1.2
--- OpenLDAP/servers/slapd/back-bdb/cache.c:1.1.1.15 Sat Mar 22 16:47:56 2008
+++ OpenLDAP/servers/slapd/back-bdb/cache.c Mon Mar 31 17:52:17 2008
@@ -252,16 +252,27 @@
int free = 0;
ei = e->e_private;
- bdb_cache_entry_db_unlock( bdb, lock );
- if ( ei ) {
- bdb_cache_entryinfo_lock( ei );
+ if ( ei &&
+ ( ei->bei_state & CACHE_ENTRY_NOT_CACHED ) &&
+ ( bdb_cache_entryinfo_trylock( ei ) == 0 ) ) {
if ( ei->bei_state & CACHE_ENTRY_NOT_CACHED ) {
+ /* Releasing the entry can only be done when
+ * we know that nobody else is using it, i.e we
+ * should have an entry_db writelock. But the
+ * flag is only set by the thread that loads the
+ * entry, and only if no other threads has found
+ * it while it was working. All other threads
+ * clear the flag, which mean that we should be
+ * the only thread using the entry if the flag
+ * is set here.
+ */
ei->bei_e = NULL;
ei->bei_state ^= CACHE_ENTRY_NOT_CACHED;
free = 1;
}
bdb_cache_entryinfo_unlock( ei );
}
+ bdb_cache_entry_db_unlock( bdb, lock );
if ( free ) {
e->e_private = NULL;
bdb_entry_return( e );
@@ -854,6 +865,11 @@
/* Ok, we found the info, do we have the entry? */
if ( rc == 0 ) {
+ if ( !( flag & ID_LOCKED )) {
+ bdb_cache_entryinfo_lock ( *eip );
+ flag |= ID_LOCKED;
+ }
+
if ( (*eip)->bei_state & CACHE_ENTRY_DELETED ) {
rc = DB_NOTFOUND;
} else {
@@ -873,13 +889,13 @@
(*eip)->bei_state |= CACHE_ENTRY_LOADING;
}
- /* If the entry was loaded before but uncached, and we need
- * it again, clear the uncached state
- */
- if ( (*eip)->bei_state & CACHE_ENTRY_NOT_CACHED ) {
- (*eip)->bei_state ^= CACHE_ENTRY_NOT_CACHED;
- if ( flag & ID_NOCACHE )
- flag ^= ID_NOCACHE;
+ if ( !load ) {
+ /* Clear the uncached state if we are not
+ * loading it, i.e it is already cached or
+ * another thread is currently loading it.
+ */
+ (*eip)->bei_state &= ~CACHE_ENTRY_NOT_CACHED;
+ flag &= ~ID_NOCACHE;
}
if ( flag & ID_LOCKED ) {
@@ -906,10 +922,14 @@
#endif
ep = NULL;
bdb_cache_lru_link( bdb, *eip );
- if ( flag & ID_NOCACHE ) {
- bdb_cache_entryinfo_lock( *eip );
- (*eip)->bei_state |= CACHE_ENTRY_NOT_CACHED;
- bdb_cache_entryinfo_unlock( *eip );
+ if ( ( flag & ID_NOCACHE ) &&
+ ( bdb_cache_entryinfo_trylock ( *eip ) == 0 ) ) {
+ /* Set the cached state only if no other thread
+ * found the info while we was loading the entry.
+ */
+ if ( (*eip)->bei_finders == 1 )
+ (*eip)->bei_state |= CACHE_ENTRY_NOT_CACHED;
+ bdb_cache_entryinfo_unlock ( *eip );
}
}
if ( rc == 0 ) {
15 years, 5 months
Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by mohelonline@googlemail.com
------=_Part_28055_26658090.1206987058866
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Thanks for the fast reply.
I just added a few lines after the 20 seconds wait for the master
replcaition:
----------------------------------
echo "Using ldapadd to populate consumer ..."
$LDAPADD -D "$MANAGERDN" -H $URI2 -w $PASSWD -f $LDIFORDERED2 \
>> $TESTOUT 2>&1
SLEEP=20
echo "Waiting $SLEEP seconds for syncrepl to receive changes..."
sleep $SLEEP
----------------------------------
These lines just add one additional ldapuser. After an additional 20seconds
pause I check if the changes (form server with URI2) are replicated.
But always the slapd prcoess dies after the ldapadd command.
I'll give your above suggestion a try and give feedback afterwards.
On Mon, Mar 31, 2008 at 6:04 PM, Pierangelo Masarati <ando(a)sys-net.it>
wrote:
> ando(a)sys-net.it wrote:
> > mohel(a)web.de wrote:
> >
> >> The above log was generated using the test050 script just adding a line
> which
> >> enters information also to a second server. Perhaps this test would
> also make
> >> sense for further releases.
> >
> > Can you please post the modification to the script?
>
> Never mind, I could reproduce it. It seems to be a duplicate of
> ITS#5437. What actually happens is that if modifications are applied to
> all three servers, those applied to consumers do not persist, nor get
> replicated; on the contrary, the consumers crash later on as described
> in ITS#5437.
>
> p.
>
>
>
> Ing. Pierangelo Masarati
> OpenLDAP Core Team
>
> SysNet s.r.l.
> via Dossi, 8 - 27100 Pavia - ITALIA
> http://www.sys-net.it
> ---------------------------------------
> Office: +39 02 23998309
> Mobile: +39 333 4963172
> Email: pierangelo.masarati(a)sys-net.it
> ---------------------------------------
>
>
>
------=_Part_28055_26658090.1206987058866
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Thanks for the fast reply.<br><br>I just added a few lines after the 20 seconds wait for the master replcaition:<br>------------------------------<div id="1eob" class="ArwC7c ckChnd">----<br>echo "Using ldapadd to populate consumer ..."<br>
$LDAPADD -D "$MANAGERDN" -H $URI2 -w $PASSWD -f $LDIFORDERED2 \<br>
>> $TESTOUT 2>&1<br><br>SLEEP=20<br>echo "Waiting $SLEEP seconds for syncrepl to receive changes..."<br>sleep $SLEEP<br>
----------------------------------<br><br>These
lines just add one additional ldapuser. After an additional 20seconds
pause I check if the changes (form server with URI2) are replicated.<br><br>But always the slapd prcoess dies after the ldapadd command.<br>
<br>I'll give your above suggestion a try and give feedback afterwards.</div><br><br><div class="gmail_quote">On Mon, Mar 31, 2008 at 6:04 PM, Pierangelo Masarati <<a href="mailto:ando@sys-net.it">ando(a)sys-net.it</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><a href="mailto:ando@sys-net.it">ando(a)sys-net.it</a> wrote:<br>
> <a href="mailto:mohel@web.de">mohel(a)web.de</a> wrote:<br>
><br>
>> The above log was generated using the test050 script just adding a line which<br>
>> enters information also to a second server. Perhaps this test would also make<br>
>> sense for further releases.<br>
><br>
> Can you please post the modification to the script?<br>
<br>
</div>Never mind, I could reproduce it. It seems to be a duplicate of<br>
ITS#5437. What actually happens is that if modifications are applied to<br>
all three servers, those applied to consumers do not persist, nor get<br>
replicated; on the contrary, the consumers crash later on as described<br>
in ITS#5437.<br>
<div><div></div><div class="Wj3C7c"><br>
p.<br>
<br>
<br>
<br>
Ing. Pierangelo Masarati<br>
OpenLDAP Core Team<br>
<br>
SysNet s.r.l.<br>
via Dossi, 8 - 27100 Pavia - ITALIA<br>
<a href="http://www.sys-net.it" target="_blank">http://www.sys-net.it</a><br>
---------------------------------------<br>
Office: +39 02 23998309<br>
Mobile: +39 333 4963172<br>
Email: <a href="mailto:pierangelo.masarati@sys-net.it">pierangelo.masarati(a)sys-net.it</a><br>
---------------------------------------<br>
<br>
<br>
</div></div></blockquote></div><br>
------=_Part_28055_26658090.1206987058866--
15 years, 5 months
Re: (ITS#5442) slapd_rq not locked before use bugfix
by rein@basefarm.no
On Sun, 30 Mar 2008, hyc(a)symas.com wrote:
> rein(a)tollevik.no wrote:
>> On Sat, 29 Mar 2008, ando(a)sys-net.it wrote:
>>> rein(a)basefarm.no wrote:
>>>> I was seeing random failures of the test050-syncrepl-multimaster test. One of
>>>> the failures was that it went into a tight loop traversing a circular runqueue
>>>> it had managed to create in slapd_rq.task_list. It seems as this was caused by
>>>> missing mutex locks around accesses to slapd_rq, which the patch uploaded to
>>>> ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes.
>>>>
>>>> Before I applied this patch the test failed after being run a few times, with it
>>>> it has now passed 100 times and is still counting.
>>> locks in back-bdb/config.c should be pointless, as modifications to the
>>> configuration should only occur while all threads are paused. The rest
>>> makes sort of sense, but I'd leave it to Howard.
>
> Ignoring the ITS#5403 changes, I don't see anything here that isn't
> config-related, therefore it's all running single-threaded.
Now that the configuration can be changed dynamically (as this test does)
I find it a bit odd that the config stuff should always be running
single-threaded. But there is obviously much I don't know about the
internals of slapd.
> Of the "relevant" changes in syncrepl.c, I note that three out of the four
> chunks of the patch are in code that is only run when using cn=config to
> delete a syncrepl configuration, and test050 never performs that operation.
> The remaining chunk only takes affect when adding syncrepl config, and again,
> slapd is single-threaded for that.
I have been experimenting a bit more with the patch, and it appears to be
the locks added to add_syncrepl() that stopped the test from failing.
The test uses ordinary ldapadd to add the syncrepl configuration after
slapd has been started, which I would expect to indicate that it has
entered the multi-threaded state. But again, it can revert to
single-threaded when modifying the configuration for all I know.
> I've also run test050 thru hundreds of iterations without any issue, without
> these patches. If there's a problem in test050, I don't believe it's in this code.
I have also run it several hundred times on other systems, it is only on
one of the 4 systems I build slapd that it fails. That is, that test
failed on one of the other systems with the 2.4.8 release, also with an
infinite loop, but that dosn't say vary much in this context.
The system where the test fail is the only multi-cpu system I have run it
on, which I recon to be the reason why it only fails there. Without this
patch the test fails about 20% of the times it is run on that system, the
reason varies between slapd being killed due to abort or seg. fault
(unknown why), or infinite loop traversing the circular slapd_rq. With
the patch it has succeeded several thousand times in a row, which I
account for more than just a coincident.
So, as long as this patch doesn't create any deadlock I can't really see
the problem with it and will stick to it in our version. I find it a bit
easier to understand and verify code that requiers locking if the locks
are always used, then I don't need to know the thread states that are
allowed to call the functions as well to know whether a lock is needed or
not..
Rein
15 years, 5 months
Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by ando@sys-net.it
ando(a)sys-net.it wrote:
> mohel(a)web.de wrote:
>
>> The above log was generated using the test050 script just adding a line which
>> enters information also to a second server. Perhaps this test would also make
>> sense for further releases.
>
> Can you please post the modification to the script?
Never mind, I could reproduce it. It seems to be a duplicate of
ITS#5437. What actually happens is that if modifications are applied to
all three servers, those applied to consumers do not persist, nor get
replicated; on the contrary, the consumers crash later on as described
in ITS#5437.
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
15 years, 5 months
Re: (ITS#5445) syncprov double-free bugfix.
by ando@sys-net.it
rein(a)basefarm.no wrote:
> The patch at the end adds missing parenthesis around a negated flags bit test in
> syncprov.c. Without them the test always fails, the entry is never duplicated
> and a double-free occur when the a_nvals is free'ed in the next statement if the
> same entry is sent to more than one recipient simultaneously.
Applied. Please test.
Thanks, p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
15 years, 5 months
Re: (ITS#5440) back-meta opens too many connections in case of error
by ando@sys-net.it
ali.pouya(a)free.fr wrote:
> I precise that The behaviour is not the same if the client binds with rootdn.
I've committed a fix that removes dangling connections as the rootdn; I
have been unable to reproduce the issue with regular users. Please
provide more useful info in order to track the issue (slapd.conf, LDIF
and requests).
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
15 years, 5 months