Here it is again. HEAD code checked out this morning.
I'm able to reproduce the crash fairly often.
testrun directory uploaded at
ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430.tgz
the core file is in testrun/con
Howard Chu wrote:
>> Looking at the log snippet above, I can see in the "syncprov_sendresp"
>> lines
>> that the cookie appears to be empty. This does appear to be similar to
>> ITS#5432,
>
> I don't see that in the snippet above. Seems unrelated.
Okay. I was just querying whether it was related to the fact that the
cookie= field appeared blank similar to the ITS#5432.
> "info threads" would probably be a useful start. And "bt full".
No problem. I've created a file containing the same snippet from the log
file and the extra information requested above, and uploaded it to
http://pastebin.siriusit.co.uk/openldap-its5486-detail.txt to help
preserve formatting.
Many thanks,
Mark.
--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063
mark.cave-ayland(a)siriusit.co.uk wrote:
> Full_Name: Mark Cave-Ayland
> Version: 2.4.8cvs-RE24-2008-04-15
> OS: RHEL4, x86
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (217.207.197.142)
>
>
> Hi there,
>
> In order to resolve issues experienced with syncrepl/glue on an existing
> openldap-2.4.8 deployment (ITS#5430), we have been using a CVS checkout of
> openldap RE24 branch taken from 2008-04-15 on one of our test systems.
>
> Unfortunately, we are still seeing random segfaults occurring roughly once a day
> which appear to point towards the syncprov overlay once again. At the moment, we
> are having difficulty reproducing the fault under test conditions, but if
> openldap is left running long enough then it is possible to obtain a core dump.
>
> The issue is occurring with a server, pelican, which is configured using the
> syncprov overlay to a number of subordinates for different parts of the tree.
> The relevant log snippet follows:
>
> Apr 28 12:18:32 pelican slapd[7688]: do_syncrep2:
> cookie=rid=142,csn=20080428111855.697316Z#000000#000#000000
> Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142
> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
> Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142 be_search (0)
> Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142
> uid=richf,ou=V,ou=W,ou=X,dc=Y,dc=Z
> Apr 28 12:18:32 pelican slapd[7688]: slap_queue_csn: queing 0x9ff7560
> 20080428111855.697316Z#000000#000#000000
> Apr 28 12:18:32 pelican slapd[7688]: syncprov_sendresp:
> cookie=rid=146,csn=20080428111855.697316Z#000000#000#000000
> Apr 28 12:18:32 pelican slapd[7688]: syncprov_sendresp:
> cookie=rid=134,csn=20080428111855.697316Z#000000#000#000000
> Apr 28 12:18:32 pelican slapd[7688]: slap_graduate_commit_csn: removing
> 0xa12ee70 20080428111855.697316Z#000000#000#000000
> Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142 be_modify (0)
> Apr 28 12:18:32 pelican slapd[7688]: slap_queue_csn: queing 0x9ff7560
> 20080428111855.697316Z#000000#000#000000
> Looking at the log snippet above, I can see in the "syncprov_sendresp" lines
> that the cookie appears to be empty. This does appear to be similar to ITS#5432,
I don't see that in the snippet above. Seems unrelated.
> although this claims to have been fixed by a commit on the 21st March (and hence
> the fix would be included within our CVS checkout). Further information can be
> provided on request.
"info threads" would probably be a useful start. And "bt full".
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
ando(a)sys-net.it wrote:
> raphael.ouazana(a)linagora.com wrote:
>> Hi,
>>
>> Le Lun 21 avril 2008 19:44, quanah(a)zimbra.com a écrit :
>>> I updated RE24 this morning with a couple of fixes from Howard.
>>>
>>> See if you can still repo, thanks.
>> It still seems to fail, after more than 100 runs:
>
> It appears from your logs that the entry
>
> "cn=ITD Staff,ou=Groups,dc=example,dc=com"
>
> is being added first, and then removed from server 3, probably related
> to some issue in sync replication.
Seems like the consumer has received a cookie from one provider, while
receiving a newer entry from the other provider. During the nonpresent search
the new entry is found and removed. So we need to restrict the filter in the
nonpresent search to only return entries older than the cookie context.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Full_Name: Mark Cave-Ayland
Version: 2.4.8cvs-RE24-2008-04-15
OS: RHEL4, x86
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (217.207.197.142)
Hi there,
In order to resolve issues experienced with syncrepl/glue on an existing
openldap-2.4.8 deployment (ITS#5430), we have been using a CVS checkout of
openldap RE24 branch taken from 2008-04-15 on one of our test systems.
Unfortunately, we are still seeing random segfaults occurring roughly once a day
which appear to point towards the syncprov overlay once again. At the moment, we
are having difficulty reproducing the fault under test conditions, but if
openldap is left running long enough then it is possible to obtain a core dump.
The issue is occurring with a server, pelican, which is configured using the
syncprov overlay to a number of subordinates for different parts of the tree.
The relevant log snippet follows:
Apr 28 12:18:32 pelican slapd[7688]: do_syncrep2:
cookie=rid=142,csn=20080428111855.697316Z#000000#000#000000
Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142
LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142 be_search (0)
Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142
uid=richf,ou=V,ou=W,ou=X,dc=Y,dc=Z
Apr 28 12:18:32 pelican slapd[7688]: slap_queue_csn: queing 0x9ff7560
20080428111855.697316Z#000000#000#000000
Apr 28 12:18:32 pelican slapd[7688]: syncprov_sendresp:
cookie=rid=146,csn=20080428111855.697316Z#000000#000#000000
Apr 28 12:18:32 pelican slapd[7688]: syncprov_sendresp:
cookie=rid=134,csn=20080428111855.697316Z#000000#000#000000
Apr 28 12:18:32 pelican slapd[7688]: slap_graduate_commit_csn: removing
0xa12ee70 20080428111855.697316Z#000000#000#000000
Apr 28 12:18:32 pelican slapd[7688]: syncrepl_entry: rid=142 be_modify (0)
Apr 28 12:18:32 pelican slapd[7688]: slap_queue_csn: queing 0x9ff7560
20080428111855.697316Z#000000#000#000000
The backtrace obtained from the core file looks like this:
Loaded symbols for /usr/lib/sasl2/libdigestmd5.so.2
Reading symbols from /usr/lib/openldap/syncprov-2.4.so.2...Reading symbols from
/usr/lib/debug/usr/lib/openldap/syncprov-2.4.so.2.0.4.debug...done.
done.
Loaded symbols for /usr/lib/openldap/syncprov-2.4.so.2
#0 0x080e6638 in overlay_entry_get_ov (op=0x7e3eefd0, dn=0x7e3eeeb0, oc=0x0,
ad=0x0, rw=0, e=0x7e3eedfc, on=0x808bdf8) at
../../../servers/slapd/backover.c:355
355 rc = on->on_bi.bi_entry_get_rw( op, dn,
(gdb) bt
#0 0x080e6638 in overlay_entry_get_ov (op=0x7e3eefd0, dn=0x7e3eeeb0, oc=0x0,
ad=0x0, rw=0, e=0x7e3eedfc, on=0x808bdf8) at
../../../servers/slapd/backover.c:355
#1 0x00b187ac in syncprov_qtask (ctx=0x7e3ef2a0, arg=0xa02f708) at
../../../../servers/slapd/overlays/syncprov.c:871
#2 0x0817a277 in ldap_int_thread_pool_wrapper (xpool=0x9db94d0) at
../../../libraries/libldap_r/tpool.c:663
#3 0x00acb371 in start_thread () from /lib/tls/libpthread.so.0
#4 0x00944ffe in clone () from /lib/tls/libc.so.6
(gdb)
The server pelican is configured using both the syncprov & glue overlays, while
the subordinate for ou=V,ou=W,ou=X,dc=Y,dc=Z is a simple syncrepl declaration of
type refreshAndPersist.
Looking at the log snippet above, I can see in the "syncprov_sendresp" lines
that the cookie appears to be empty. This does appear to be similar to ITS#5432,
although this claims to have been fixed by a commit on the 21st March (and hence
the fix would be included within our CVS checkout). Further information can be
provided on request.
Many thanks,
Mark.
rein(a)tollevik.no wrote:
> On Wed, 9 Apr 2008, h.b.furuseth(a)usit.uio.no wrote:
>
>> Does this help? From my fiddling with ITS#5340 (REP_ENTRY_MODIFIABLE).
>> I do not understand syncprov's handling of REP_ENTRY_MUSTRELEASE though.
>> (For one thing it seems to assume that REP_ENTRY_MUSTRELEASE is set if
>> and only if rs.sr_entry->e_private != NULL. Which is possibly true with
>> back-bdb but seems a shaky assumption in general.)
>
> Almost, except that it triggers an abort.. Using be_entry_release_r() was
> not correct, as it ended up calling entry_free(). It works if I changes
> it to use overlay_entry_release_ov() instead, as in the alternative patch
> at the end.
>
> I have put a test script that shows this deadlock on:
>
> ftp://ftp.openldap.org/incoming/test053-syncprov-glue
>
> It deadlocks without this patch, it and the rest of the test suite succeed
> with it :-) I haven't tested it in production though..
This last patch is now committed to HEAD. I am ignoring the first patch; it is
unsafe because it unlocks the si_csn_rwlock too soon. (si_ctxcsn is being used
directly in the modify request and must be held consistent until that completes.)
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
rein(a)basefarm.no wrote:
> Full_Name: Rein Tollevik
> Version: CVS head
> OS:
> URL:
> Submission from: (NULL) (81.93.160.250)
>
>
> We have seen occasional seg. faults in syncprov_qtask() where it was
> passed a syncops pointer containing garbage in its arg. It looks as
> this could happen if syncprov_free_syncop is called to free an abandoned
> operation. I hope the patch at the should fix this, it makes sure to
> remove the syncops->s_qtask (if any) from the runqueue before freeing
> the syncops itself.
I think this could also cause ITS#5452.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Pierangelo Masarati <ando(a)sys-net.it> wrote:
> > While I was here, I added myself to the aknowledgements in the man page.
> Applied to HEAD (with minor changes); please test.
A quick test suggests it works as intended. Any chance to have it in an
upcoming 2.4 release?
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu(a)netbsd.org
Emmanuel Dreyfus wrote:
> Sorry. The new patch was uploaded in FTP:
> manu-20080426.patch
>
> While I was here, I added myself to the aknowledgements in the man page.
Applied to HEAD (with minor changes); please test.
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it
---------------------------------------
Office: +39 02 23998309
Mobile: +39 333 4963172
Email: pierangelo.masarati(a)sys-net.it
---------------------------------------
Pierangelo Masarati <ando(a)sys-net.it> wrote:
> I'm considering your patch for inclusion in HEAD code; I see that in
> your patch you didn't follow the IPR notice as indicated in the
> contributing guidelines. Please resubmit the patch adding at the top
> the notices illustrated in
> <http://www.openldap.org/devel/contributing.html#notice>:
>
> - notice of origin
> - rights statement
Sorry. The new patch was uploaded in FTP:
manu-20080426.patch
While I was here, I added myself to the aknowledgements in the man page.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu(a)netbsd.org