openldap-bugs March 2008

openldap-bugs@openldap.org

34 participants
162 discussions

Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by ando＠sys-net.it 31 Mar '08

31 Mar '08

hyc(a)symas.com wrote: > ando(a)sys-net.it wrote: >> ando(a)sys-net.it wrote: >>> mohel(a)web.de wrote: >>> >>>> The above log was generated using the test050 script just adding a line which >>>> enters information also to a second server. Perhaps this test would also make >>>> sense for further releases. >>> Can you please post the modification to the script? >> Never mind, I could reproduce it. It seems to be a duplicate of >> ITS#5437. What actually happens is that if modifications are applied to >> all three servers, those applied to consumers do not persist, nor get >> replicated; on the contrary, the consumers crash later on as described >> in ITS#5437. > > #5437? What you're talking about doesn't seem to match any text in ITS#5437. Well, sorry for not providing more details, I was going to investigate it a little bit more but I had to leave. The crash to me occurs in syncprov_done_ctrl(), as indicated in ITS#5437. What happens is that syncprov_done_ctrl() with an invalid cookie (unset; clearing it in syncprov_search_response(), as suggested in ITS#5437 leads to the behavior I described (writes to consumers are simply ignored, don't persist in the consumers and are not replicated). To reproduce, add something to a consumer close to the end of test050, before killing the servers, and wait a little bit to let things proceed to the crash. p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati(a)sys-net.it ---------------------------------------

1 0

Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by hyc＠symas.com 31 Mar '08

31 Mar '08

ando(a)sys-net.it wrote: > ando(a)sys-net.it wrote: >> mohel(a)web.de wrote: >> >>> The above log was generated using the test050 script just adding a line which >>> enters information also to a second server. Perhaps this test would also make >>> sense for further releases. >> Can you please post the modification to the script? > > Never mind, I could reproduce it. It seems to be a duplicate of > ITS#5437. What actually happens is that if modifications are applied to > all three servers, those applied to consumers do not persist, nor get > replicated; on the contrary, the consumers crash later on as described > in ITS#5437. #5437? What you're talking about doesn't seem to match any text in ITS#5437. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

Re: (ITS#5442) slapd_rq not locked before use bugfix
by hyc＠symas.com 31 Mar '08

31 Mar '08

Rein Tollevik wrote: > On Sun, 30 Mar 2008, hyc(a)symas.com wrote: > >> rein(a)tollevik.no wrote: >>> On Sat, 29 Mar 2008, ando(a)sys-net.it wrote: >>>> rein(a)basefarm.no wrote: > >>>>> I was seeing random failures of the test050-syncrepl-multimaster test. One of >>>>> the failures was that it went into a tight loop traversing a circular runqueue >>>>> it had managed to create in slapd_rq.task_list. It seems as this was caused by >>>>> missing mutex locks around accesses to slapd_rq, which the patch uploaded to >>>>> ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes. >>>>> >>>>> Before I applied this patch the test failed after being run a few times, with it >>>>> it has now passed 100 times and is still counting. >>>> locks in back-bdb/config.c should be pointless, as modifications to the >>>> configuration should only occur while all threads are paused. The rest >>>> makes sort of sense, but I'd leave it to Howard. >> Ignoring the ITS#5403 changes, I don't see anything here that isn't >> config-related, therefore it's all running single-threaded. > > Now that the configuration can be changed dynamically (as this test does) > I find it a bit odd that the config stuff should always be running > single-threaded. But there is obviously much I don't know about the > internals of slapd. This was discussed at the 2003 OpenLDAP Developers' Day. http://www.openldap.org/conf/odd-wien-2003/proceedings.html Page 7 of my slides touched on it. It's also mentioned again here http://www.openldap.org/lists/openldap-devel/200505/msg00062.html The main point is that the original config stuff assumed that it was only executing during slapd startup, and single-threaded. Putting locks around all of the potential configurable value accesses would have been too much work, so the decision was made to force slapd to be single-threaded whenever writing to the config. I'm guessing that what's happening here is just a broken assumption - suspending the thread pool doesn't in fact freeze everything in slapd. In particular, the slapd listener thread may still wake up for a select() timeout to handle the slapd runqueue. Even though no tasks submitted as a result of this will execute (because they'll just get queued into the suspended thread pool) the runqueue itself is still being manipulated. Given that explanation, your patches make sense. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

Re: (ITS#5439) back-bdb race condition
by quanah＠zimbra.com 31 Mar '08

31 Mar '08

--On Monday, March 31, 2008 6:27 PM +0000 rein(a)basefarm.no wrote: > We did see segmentation faults both Friday and Monday (not very much > going on during the weekend..), but so far those appear to be related to > ITS#5445. A slapd and db_checkpoint (run out of cron) deadlock which I > hope is due to the BerkeleyDB bug mentioned in ITS#5391 has also > occurred, so I'll upgrade from 4.6.18 to 4.6.21.1 to see if that problem > goes away. Why do you run db_checkpoint out of cron? Slapd will already take care of checkpointing the database for you using the "checkpoint" directive... --Quanah -- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

1 0

Re: (ITS#5439) back-bdb race condition
by rein＠basefarm.no 31 Mar '08

31 Mar '08

On Tue, 25 Mar 2008, Howard Chu wrote: > Rein Tollevik wrote: >> On Tue, 25 Mar 2008, hyc(a)symas.com wrote: >> It now looks to me as if the entire rs->sr_entry was released an reused, >> and that the bug probably is in back-bdb. It just always happened when >> syncprov was running as this is a master server with mostly writes >> except from the reads syncprov does. Which also means that the title on >> the bug report is probably misleading :-( >> >> The reason for my shift of focus is some highly suspicious variables >> found in the bdb_search() frame. > > Look at the value of idflag in bdb_search. (see the comment around 672) > > Could be a locking bug with ID_NOCACHE in cache.c... Yes, it looks very much as this was the problem. We have been running with the patch at the end since thursday afternoon, and so far it seem to have worked :-) Although it is a bit too early to conclude, especially as we have seen other problems as well. We did see segmentation faults both friday and monday (not very much going on during the weekend..), but so far those appear to be related to ITS#5445. A slapd and db_checkpoint (run out of cron) deadlock which I hope is due to the BerkeleyDB bug mentioned in ITS#5391 has also occurred, so I'll upgrade from 4.6.18 to 4.6.21.1 to see if that problem goes away. Now, back to this bug. There is a potential bug in bdb_cache_find_id() as it inspects and modifies the EntryInfo without it being locked. Although I don't think this was the cause here, as bdb_search() always passes a NULL eip which mean that bdb_cache_find_ndn() should have acquired a lock. The patch always locks the entryinfo though, as I believe that to be necessary for the rest of the patch to be correct. The real problem was in bdb_cache_return_entry_rw() which released ei->bei_e without making sure it was the only thread using it. I.e it should have had a write lock on the entry_db (as bdb_cache_lru_purge() makes sure to have), but instead it freed it after releasing the read lock. Instead of write locking the entry before releasing it my alternative solution is to only let the thread that loads the entry set the CACHE_ENTRY_NOT_CACHED flag, and only if no other threads found the entryinfo while it loaded the entry. All other threads clears the flag when they find they don't need to load the entry. I also use trylock to lock the entryinfo before setting or inspecting the cache flag, for two reasons. First, honoring this flag should not be important enough to wait for a lock (after all, it should be cleared anyhow if anyone else is interested in the same entry). Second (and most important), to avoid any deadlock situations, which my first attempt to fix this bug managed to create when I moved the release of the entryinfo lock to after the entry_db lock had been acquired in bdb_cache_find_id(). Rein Index: OpenLDAP/servers/slapd/back-bdb/cache.c diff -u OpenLDAP/servers/slapd/back-bdb/cache.c:1.1.1.15 OpenLDAP/servers/slapd/back-bdb/cache.c:1.2 --- OpenLDAP/servers/slapd/back-bdb/cache.c:1.1.1.15 Sat Mar 22 16:47:56 2008 +++ OpenLDAP/servers/slapd/back-bdb/cache.c Mon Mar 31 17:52:17 2008 @@ -252,16 +252,27 @@ int free = 0; ei = e->e_private; - bdb_cache_entry_db_unlock( bdb, lock ); - if ( ei ) { - bdb_cache_entryinfo_lock( ei ); + if ( ei && + ( ei->bei_state & CACHE_ENTRY_NOT_CACHED ) && + ( bdb_cache_entryinfo_trylock( ei ) == 0 ) ) { if ( ei->bei_state & CACHE_ENTRY_NOT_CACHED ) { + /* Releasing the entry can only be done when + * we know that nobody else is using it, i.e we + * should have an entry_db writelock. But the + * flag is only set by the thread that loads the + * entry, and only if no other threads has found + * it while it was working. All other threads + * clear the flag, which mean that we should be + * the only thread using the entry if the flag + * is set here. + */ ei->bei_e = NULL; ei->bei_state ^= CACHE_ENTRY_NOT_CACHED; free = 1; } bdb_cache_entryinfo_unlock( ei ); } + bdb_cache_entry_db_unlock( bdb, lock ); if ( free ) { e->e_private = NULL; bdb_entry_return( e ); @@ -854,6 +865,11 @@ /* Ok, we found the info, do we have the entry? */ if ( rc == 0 ) { + if ( !( flag & ID_LOCKED )) { + bdb_cache_entryinfo_lock ( *eip ); + flag |= ID_LOCKED; + } + if ( (*eip)->bei_state & CACHE_ENTRY_DELETED ) { rc = DB_NOTFOUND; } else { @@ -873,13 +889,13 @@ (*eip)->bei_state |= CACHE_ENTRY_LOADING; } - /* If the entry was loaded before but uncached, and we need - * it again, clear the uncached state - */ - if ( (*eip)->bei_state & CACHE_ENTRY_NOT_CACHED ) { - (*eip)->bei_state ^= CACHE_ENTRY_NOT_CACHED; - if ( flag & ID_NOCACHE ) - flag ^= ID_NOCACHE; + if ( !load ) { + /* Clear the uncached state if we are not + * loading it, i.e it is already cached or + * another thread is currently loading it. + */ + (*eip)->bei_state &= ~CACHE_ENTRY_NOT_CACHED; + flag &= ~ID_NOCACHE; } if ( flag & ID_LOCKED ) { @@ -906,10 +922,14 @@ #endif ep = NULL; bdb_cache_lru_link( bdb, *eip ); - if ( flag & ID_NOCACHE ) { - bdb_cache_entryinfo_lock( *eip ); - (*eip)->bei_state |= CACHE_ENTRY_NOT_CACHED; - bdb_cache_entryinfo_unlock( *eip ); + if ( ( flag & ID_NOCACHE ) && + ( bdb_cache_entryinfo_trylock ( *eip ) == 0 ) ) { + /* Set the cached state only if no other thread + * found the info while we was loading the entry. + */ + if ( (*eip)->bei_finders == 1 ) + (*eip)->bei_state |= CACHE_ENTRY_NOT_CACHED; + bdb_cache_entryinfo_unlock ( *eip ); } } if ( rc == 0 ) {

1 0

Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by mohelonline＠googlemail.com 31 Mar '08

31 Mar '08

------=_Part_28055_26658090.1206987058866 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Thanks for the fast reply. I just added a few lines after the 20 seconds wait for the master replcaition: ---------------------------------- echo "Using ldapadd to populate consumer ..." $LDAPADD -D "$MANAGERDN" -H $URI2 -w $PASSWD -f $LDIFORDERED2 \ >> $TESTOUT 2>&1 SLEEP=20 echo "Waiting $SLEEP seconds for syncrepl to receive changes..." sleep $SLEEP ---------------------------------- These lines just add one additional ldapuser. After an additional 20seconds pause I check if the changes (form server with URI2) are replicated. But always the slapd prcoess dies after the ldapadd command. I'll give your above suggestion a try and give feedback afterwards. On Mon, Mar 31, 2008 at 6:04 PM, Pierangelo Masarati <ando(a)sys-net.it> wrote: > ando(a)sys-net.it wrote: > > mohel(a)web.de wrote: > > > >> The above log was generated using the test050 script just adding a line > which > >> enters information also to a second server. Perhaps this test would > also make > >> sense for further releases. > > > > Can you please post the modification to the script? > > Never mind, I could reproduce it. It seems to be a duplicate of > ITS#5437. What actually happens is that if modifications are applied to > all three servers, those applied to consumers do not persist, nor get > replicated; on the contrary, the consumers crash later on as described > in ITS#5437. > > p. > > > > Ing. Pierangelo Masarati > OpenLDAP Core Team > > SysNet s.r.l. > via Dossi, 8 - 27100 Pavia - ITALIA > http://www.sys-net.it > --------------------------------------- > Office: +39 02 23998309 > Mobile: +39 333 4963172 > Email: pierangelo.masarati(a)sys-net.it > --------------------------------------- > > > ------=_Part_28055_26658090.1206987058866 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Thanks for the fast reply.<br><br>I just added a few lines after the 20 seconds wait for the master replcaition:<br>------------------------------<div id="1eob" class="ArwC7c ckChnd">----<br>echo "Using ldapadd to populate consumer ..."<br> $LDAPADD -D "$MANAGERDN" -H $URI2 -w $PASSWD -f $LDIFORDERED2 \<br>         >> $TESTOUT 2>&1<br><br>SLEEP=20<br>echo "Waiting $SLEEP seconds for syncrepl to receive changes..."<br>sleep $SLEEP<br> ----------------------------------<br><br>These lines just add one additional ldapuser. After an additional 20seconds pause I check if the changes (form server with URI2) are replicated.<br><br>But always the slapd prcoess dies after the ldapadd command.<br> <br>I'll give your above suggestion a try and give feedback afterwards.</div><br><br><div class="gmail_quote">On Mon, Mar 31, 2008 at 6:04 PM, Pierangelo Masarati <<a href="mailto:ando@sys-net.it">ando(a)sys-net.it</a>> wrote:<br> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><a href="mailto:ando@sys-net.it">ando(a)sys-net.it</a> wrote:<br> > <a href="mailto:mohel@web.de">mohel(a)web.de</a> wrote:<br> ><br> >> The above log was generated using the test050 script just adding a line which<br> >> enters information also to a second server. Perhaps this test would also make<br> >> sense for further releases.<br> ><br> > Can you please post the modification to the script?<br> <br> </div>Never mind, I could reproduce it.  It seems to be a duplicate of<br> ITS#5437.  What actually happens is that if modifications are applied to<br> all three servers, those applied to consumers do not persist, nor get<br> replicated; on the contrary, the consumers crash later on as described<br> in ITS#5437.<br> <div><div></div><div class="Wj3C7c"><br> p.<br> <br> <br> <br> Ing. Pierangelo Masarati<br> OpenLDAP Core Team<br> <br> SysNet s.r.l.<br> via Dossi, 8 - 27100 Pavia - ITALIA<br> <a href="http://www.sys-net.it" target="_blank">http://www.sys-net.it</a><br> ---------------------------------------<br> Office:  +39 02 23998309<br> Mobile:  +39 333 4963172<br> Email:   <a href="mailto:pierangelo.masarati@sys-net.it">pierangelo.masarati(a)sys-net.it</a><br> ---------------------------------------<br> <br> <br> </div></div></blockquote></div><br> ------=_Part_28055_26658090.1206987058866--

1 0

Re: (ITS#5442) slapd_rq not locked before use bugfix
by rein＠basefarm.no 31 Mar '08

31 Mar '08

On Sun, 30 Mar 2008, hyc(a)symas.com wrote: > rein(a)tollevik.no wrote: >> On Sat, 29 Mar 2008, ando(a)sys-net.it wrote: >>> rein(a)basefarm.no wrote: >>>> I was seeing random failures of the test050-syncrepl-multimaster test. One of >>>> the failures was that it went into a tight loop traversing a circular runqueue >>>> it had managed to create in slapd_rq.task_list. It seems as this was caused by >>>> missing mutex locks around accesses to slapd_rq, which the patch uploaded to >>>> ftp://ftp.openldap.org/incoming/slapd_rq_lock.patch fixes. >>>> >>>> Before I applied this patch the test failed after being run a few times, with it >>>> it has now passed 100 times and is still counting. >>> locks in back-bdb/config.c should be pointless, as modifications to the >>> configuration should only occur while all threads are paused. The rest >>> makes sort of sense, but I'd leave it to Howard. > > Ignoring the ITS#5403 changes, I don't see anything here that isn't > config-related, therefore it's all running single-threaded. Now that the configuration can be changed dynamically (as this test does) I find it a bit odd that the config stuff should always be running single-threaded. But there is obviously much I don't know about the internals of slapd. > Of the "relevant" changes in syncrepl.c, I note that three out of the four > chunks of the patch are in code that is only run when using cn=config to > delete a syncrepl configuration, and test050 never performs that operation. > The remaining chunk only takes affect when adding syncrepl config, and again, > slapd is single-threaded for that. I have been experimenting a bit more with the patch, and it appears to be the locks added to add_syncrepl() that stopped the test from failing. The test uses ordinary ldapadd to add the syncrepl configuration after slapd has been started, which I would expect to indicate that it has entered the multi-threaded state. But again, it can revert to single-threaded when modifying the configuration for all I know. > I've also run test050 thru hundreds of iterations without any issue, without > these patches. If there's a problem in test050, I don't believe it's in this code. I have also run it several hundred times on other systems, it is only on one of the 4 systems I build slapd that it fails. That is, that test failed on one of the other systems with the 2.4.8 release, also with an infinite loop, but that dosn't say vary much in this context. The system where the test fail is the only multi-cpu system I have run it on, which I recon to be the reason why it only fails there. Without this patch the test fails about 20% of the times it is run on that system, the reason varies between slapd being killed due to abort or seg. fault (unknown why), or infinite loop traversing the circular slapd_rq. With the patch it has succeeded several thousand times in a row, which I account for more than just a coincident. So, as long as this patch doesn't create any deadlock I can't really see the problem with it and will stick to it in our version. I find it a bit easier to understand and verify code that requiers locking if the locks are always used, then I don't need to know the thread states that are allowed to call the functions as well to know whether a lock is needed or not.. Rein

1 0

Re: (ITS#5444) slapd seg. fault using mulitmirrored mode
by ando＠sys-net.it 31 Mar '08

31 Mar '08

ando(a)sys-net.it wrote: > mohel(a)web.de wrote: > >> The above log was generated using the test050 script just adding a line which >> enters information also to a second server. Perhaps this test would also make >> sense for further releases. > > Can you please post the modification to the script? Never mind, I could reproduce it. It seems to be a duplicate of ITS#5437. What actually happens is that if modifications are applied to all three servers, those applied to consumers do not persist, nor get replicated; on the contrary, the consumers crash later on as described in ITS#5437. p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati(a)sys-net.it ---------------------------------------

1 0

Re: (ITS#5445) syncprov double-free bugfix.
by ando＠sys-net.it 31 Mar '08

31 Mar '08

rein(a)basefarm.no wrote: > The patch at the end adds missing parenthesis around a negated flags bit test in > syncprov.c. Without them the test always fails, the entry is never duplicated > and a double-free occur when the a_nvals is free'ed in the next statement if the > same entry is sent to more than one recipient simultaneously. Applied. Please test. Thanks, p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati(a)sys-net.it ---------------------------------------

1 0

Re: (ITS#5440) back-meta opens too many connections in case of error
by ando＠sys-net.it 31 Mar '08

31 Mar '08

ali.pouya(a)free.fr wrote: > I precise that The behaviour is not the same if the client binds with rootdn. I've committed a fix that removes dangling connections as the rootdn; I have been unable to reproduce the issue with regular users. Please provide more useful info in order to track the issue (slapd.conf, LDIF and requests). p. Ing. Pierangelo Masarati OpenLDAP Core Team SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati(a)sys-net.it ---------------------------------------

1 0

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

openldap-bugs March 2008