Re: (ITS#6158) syncprov: assert causing slapd to core dump
by jonathan@phillipoux.net
On 02.06.2009 18:23, Howard Chu wrote:
> jonathan(a)phillipoux.net wrote:
>> OK, I've spent some more time trying to understand this part of
>> syncprov.c. From what I understand :
>>
>> - the assert failure in ldap_pvt_runqueue_resched is caused by the fact
>> syncprov_qstart is trying to "reschedule" a task that is no longer in
>> the task_list
>> - the only time the task is removed from the task_list (via
>> ldap_pvt_runqueue_remove) is when the task is being run, in
>> syncprov_qtask, if syncprov_qplay returns !=0
>> - the next time syncprov_qstart is called, it finds "so->s_qtask" is not
>> NULL, and tries to reschedule the task, but it's no longer in the
>> task_list.
>>
>> I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
>> just after removing the task from the task_list. So that when
>> syncprov_qstart is called again, it goes into
>> ldap_pvt_runqueue_insert... The patch is attached.
>>
>> Unfortunately, I can't confirm it fixes the bug since I can't reproduce
>> it... For those who understand the logic behind this, does this make any
>> sense? :)
>
> Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776.
Indeed, that's great. Thanks a lot!
> Of course, all of this code has been removed from RE24 as of 1.265.
Will this patch make it into RE23 for a possible maintenance release of 2.3?
Regards,
Jonathan
--
--------------------------------------------------------------
Jonathan Clarke - jonathan(a)phillipoux.net
--------------------------------------------------------------
Ldap Synchronization Connector (LSC) - http://lsc-project.org
--------------------------------------------------------------
14 years
Re: (ITS#6138) Bad Cancel/Abandon/"internal abandon"/Syncprov interactions
by h.b.furuseth@usit.uio.no
back-ldap:extended.c also does "suppress response, it has been sent",
but does it by returning and setting rs->sr_err = SLAPD_ABANDON. Might
break assumptions somewhere that SLAPD_ABANDON implies o_abandon was
set. And I guess the hack fails if the operation gets cancelled.
========================================================================
I think these are the Operation states related to Cancel and Abandon:
op->o_abandon is set for these - could extend to multiple values:
A) Operation Abandoned/Cancelled by client.
B) Operation implicitly abandoned by client. (Bind or lost connection)
C) Operation abandoned by server. (It wants to close the connection)
D) Suppress response - a duplicate of the operation will proceed. (syncprov)
E) Suppress response - final send_ldap_response() was done. (retcode overlay)
rs->sr_err == SLAPD_ABANDON if:
F) The backend obeyed o_abandon. (Cancel op, if any, will succeed)
G=E) Suppress response - final send_ldap_response() was done. (back-ldap)
op->o_cancel packs these states/values:
H) The o_abandon is due to a Cancel.
I) Cancel operation wants a result, cancelled op must set it and wait.
J) Result is available to the Cancel operation.
K) Result. (LDAP result code, or SLAP_CANCEL_ACK for success)
L) Cancel operation has fetched result, cancelled operation can proceed.
States that fit in none of the above, or poorly so:
M) Operation must not be waited for, e.g. by Cancel.
Operation is itself waiting for others, e.g. cn=config update.
N) Operation invisible to Abandon/Cancel/internal abandon.
msgID reusable due to result sent to client. Also case D (syncprov)?
Fix by removing the op from op->o_conn->c_ops? Or does that just
move the problem around? Would need to do something to o_conn to
prevent connection_close() from doing connection_destroy().
O) Operation result has been committed, do not abandon. ITS#6059.
But o_abandon can be set while trying to commit, unless this flag is
set before trying - in which case we can't abandon an operation which
is failing to commit, which may be when it's most relevant.
Could reset o_abandon, if anyone can keep straight the consequences.
Or replace the 'if ( op->o_abandon )' tests with some macro call.
Still, interactions with other states could be a problem.
About the o_abandon values above:
B can be treated like A, I think.
C differs in that Cancel/Abandon(operation) should not say "already
abandoned" since the client doen't know about the abandon.
Could be solved with a vague error message.
D-E differ in that o_abandon gets set even though the backends'
cancel/abandon handlers were not called. Unsure of the effects of that.
D Syncprov duplicating a Persistent Search operation.
Handled similar to a server-initiated abandon? Except if the
operation cannot be "invisible to Abandon/Cancel" above it must remain
possible to Abandon/Cancel it.
E Suppress response - response has been sent:
Set when exiting slap_send_ldap_result() & co?
Handled similar to a server-initiated abandon?
At the time slap_send_ldap_result() is called again, the operation
may have set up things which need to be cleaned up in the normal way.
Yet it has already gone through that function once, doing callbacks
etc. Must "final response" code be prepared to be called twice?
Beyond that, the main problem would be code which transitions to one
state to another, it needs to handle the other cases.
--
Hallvard
14 years
Re: (ITS#6152) proxycache enhancements
by mhardin@symas.com
It might be less intrusive codewise and more flexible if we left the
behavior of cache expiration the same and added a parameter to each
template called "Time to Refresh" (TTR). Then you set long or
unlimited cache expirations, which are always in effect, but set a
shorter TTR that would trigger an asynchronous refresh when the TTR
expired. If the db is not available these refreshes will simply fail,
but the data will remain in the cache at least until it's expired by
the usual means.
This gives the solution designer the option of deciding how long a
system can run disconnected while still being able to separately
determine how stale the contents of the cache will get when connected.
It also means that pcache itself doesn't need to switch modes based on
whether it thinks it's connected to a db or not, and it fact may not
need to even know if it is connected or not.
There is still room in this design for a flag that controls whether
pcache should behave as if it's disconnected or connected, but I'm not
sure how useful that is given the changes described above.
Cheers,
-Matt
Matthew Hardin
Symas Corporation - The LDAP Guys
http://www.symas.com
14 years
Re: (ITS#6158) syncprov: assert causing slapd to core dump
by hyc@symas.com
jonathan(a)phillipoux.net wrote:
> OK, I've spent some more time trying to understand this part of
> syncprov.c. From what I understand :
>
> - the assert failure in ldap_pvt_runqueue_resched is caused by the fact
> syncprov_qstart is trying to "reschedule" a task that is no longer in
> the task_list
> - the only time the task is removed from the task_list (via
> ldap_pvt_runqueue_remove) is when the task is being run, in
> syncprov_qtask, if syncprov_qplay returns !=0
> - the next time syncprov_qstart is called, it finds "so->s_qtask" is not
> NULL, and tries to reschedule the task, but it's no longer in the task_list.
>
> I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
> just after removing the task from the task_list. So that when
> syncprov_qstart is called again, it goes into
> ldap_pvt_runqueue_insert... The patch is attached.
>
> Unfortunately, I can't confirm it fixes the bug since I can't reproduce
> it... For those who understand the logic behind this, does this make any
> sense? :)
Ah, you want rev 1.249 of syncprov.c. Closing this as a dup of ITS#5776.
Of course, all of this code has been removed from RE24 as of 1.265.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years
Re: (ITS#6158) syncprov: assert causing slapd to core dump
by jonathan@phillipoux.net
This is a multi-part message in MIME format.
--------------080204020206070201080405
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
On 02.06.2009 12:28, jonathan(a)phillipoux.net wrote:
> Full_Name: Jonathan Clarke
> Version: 2.3.43
> OS: Solaris
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (213.41.243.192)
>
>
> Hi,
>
> I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally -
> once every week or two, always during the night. At this particular time a large
> number of operations are performed, including mass deletes and adds. I haven't
> been able to reproduce this bug, just watch it happen on the production server
> every now and again...
>
> I managed to obtain a coredump, and a backtrace (at the end of this message). I
> realize this isn't much to go on, but I'm rather unfamiliar with this part of
> the code, so I wondered if anyone has an idea what's going on here?
>
> FWIW, the dynlist and chain overlays are in use on the server, and the database
> is bdb, with a syncrepl consumer as well as syncprov overlay.
>
>
> Backtrace follows:
> 8<-------------------------------------------------------------
> Thread 1 (process 1054014 ):
> #0 0xfee4aa58 in _lwp_kill () from /lib/libc.so.1
> #1 0xfede5a64 in raise () from /lib/libc.so.1
> #2 0xfedc1954 in abort () from /lib/libc.so.1
> #3 0xfedc1b90 in _assert () from /lib/libc.so.1
> #4 0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0,
> defer=0) at rq.c:165
> #5 0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933
> #6 0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at
> syncprov.c:982
> #7 0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at
> syncprov.c:1175
> #8 0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at
> syncprov.c:1561
> #9 0x000575cc in ?? ()
> #10 0x000575cc in ?? ()
> 8<-------------------------------------------------------------
>
> Thanks in advance for any pointers!
>
OK, I've spent some more time trying to understand this part of
syncprov.c. From what I understand :
- the assert failure in ldap_pvt_runqueue_resched is caused by the fact
syncprov_qstart is trying to "reschedule" a task that is no longer in
the task_list
- the only time the task is removed from the task_list (via
ldap_pvt_runqueue_remove) is when the task is being run, in
syncprov_qtask, if syncprov_qplay returns !=0
- the next time syncprov_qstart is called, it finds "so->s_qtask" is not
NULL, and tries to reschedule the task, but it's no longer in the task_list.
I've written a patch that sets "so->s_qtask" to NULL in syncprov_qtask,
just after removing the task from the task_list. So that when
syncprov_qstart is called again, it goes into
ldap_pvt_runqueue_insert... The patch is attached.
Unfortunately, I can't confirm it fixes the bug since I can't reproduce
it... For those who understand the logic behind this, does this make any
sense? :)
Regards,
Jonathan
--
--------------------------------------------------------------
Jonathan Clarke - jonathan(a)phillipoux.net
--------------------------------------------------------------
Ldap Synchronization Connector (LSC) - http://lsc-project.org
--------------------------------------------------------------
--------------080204020206070201080405
Content-Type: text/x-patch;
name="patch-syncprov-20090602.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="patch-syncprov-20090602.patch"
Index: servers/slapd/overlays/syncprov.c
===================================================================
RCS file: /repo/OpenLDAP/pkg/ldap/servers/slapd/overlays/syncprov.c,v
retrieving revision 1.56.2.51
diff -u -p -r1.56.2.51 syncprov.c
--- servers/slapd/overlays/syncprov.c 9 Jul 2008 20:53:13 -0000 1.56.2.51
+++ servers/slapd/overlays/syncprov.c 2 Jun 2009 15:57:21 -0000
@@ -908,6 +908,7 @@ syncprov_qtask( void *ctx, void *arg )
} else {
/* bail out on any error */
ldap_pvt_runqueue_remove( &slapd_rq, rtask );
+ if ( so ) so->s_qtask = NULL;
}
ldap_pvt_thread_mutex_unlock( &slapd_rq.rq_mutex );
--------------080204020206070201080405--
14 years
Re: (ITS#6153) Segfault during Heimdal's kadmin -l init Realm
by hyc@symas.com
dewayne_freebsd(a)yahoo.com wrote:
> Full_Name: Dewayne Geraghty
> Version: 2.4.16
> OS: FreeBSD 7.2R
> URL: http://www.consciuminternational.com.au/ldap
> Submission from: (NULL) (58.172.112.108)
>
>
> FreeBSD version 7.2; Heimdal V1.2.1; OpenLDAP 2.4.16
> Heimdal and OpenLDAP are built for heimdal to use OpenLDAP as backend.
> Segmentation fault during
> kadmin -l
> init HS
Use Heimdal 1.2.2.
https://roundup.it.su.se/jira/browse/HEIMDAL-220
>
> slapd and heimdal work correctly, independently.
>
> slapd is running at debug 1019, logs are at enclosed URL along with the full gdb
> trace, and configuration files. If I can assist please advise.
>
> This is a single Pentium CPU, and gcc flags
> CFLAGS= -pipe -g3 -ggdb3 -O0 -march=pentium4 -mtune=pentium4 -DDO_KRB5
> -DDO_SAMBA -DHAVE_OPENSSL
>
> #0 0x286447b6 in memmove () from /lib/libc.so.7
> #1 0x282a10e8 in ber_write () from /usr/local/lib/liblber-2.4.so.6
> #2 0x2829ebf7 in ber_put_ostring () from /usr/local/lib/liblber-2.4.so.6
> #3 0x2829ed14 in ber_put_berval () from /usr/local/lib/liblber-2.4.so.6
> #4 0x2829faca in ber_printf () from /usr/local/lib/liblber-2.4.so.6
> #5 0x2821a0ee in ldap_add_ext () from /usr/local/lib/libldap-2.4.so.6
> #6 0x2821a378 in ldap_add_ext_s () from /usr/local/lib/libldap-2.4.so.6
> #7 0x280ad493 in LDAP_store (context=0x287050b0, db=0x2870e040, flags=0,
> entry=0xbfbfe790) at hdb-ldap.c:1600
> #8 0x2809875b in kadm5_s_create_principal (server_handle=0x2871b0c0,
> princ=0xbfbfea3c, mask=17, password=0xbfbfe830 "bQdxg9drKf")
> at create_s.c:182
> #9 0x2808da1c in kadm5_create_principal (server_handle=0x2871b0c0,
> princ=0xbfbfea3c, mask=17, password=0xbfbfe830 "bQdxg9drKf")
> at common_glue.c:64
> #10 0x0804e496 in ?? ()
> #11 0x2871b0c0 in ?? ()
> #12 0xbfbfea3c in ?? ()
> #13 0x00000011 in ?? ()
> #14 0xbfbfe830 in ?? ()
> #15 0x28084000 in ?? ()
> #16 0x28084200 in ?? ()
> #17 0x28084400 in ?? ()
> #18 0x285b3c8d in _pthread_mutex_init_calloc_cb () from /lib/libc.so.7
> #19 0x0804e8c9 in ?? ()
> #20 0x2870c0e0 in ?? ()
> #21 0x00000000 in ?? ()
> #22 0x00000000 in ?? ()
> #23 0x00000000 in ?? ()
> #24 0x2870c085 in ?? ()
> #25 0x00000000 in ?? ()
> #26 0x28650030 in ?? () from /lib/libc.so.7
>
> I have spent weeks trying to get this to work. (Because I'm using and modifing
> the FreeBSD ports system to build and use the latest version of LDAP and
> Heimdal.)
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years
(ITS#6158) syncprov: assert causing slapd to core dump
by jonathan@phillipoux.net
Full_Name: Jonathan Clarke
Version: 2.3.43
OS: Solaris
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (213.41.243.192)
Hi,
I have a 2.3.43 running on a Solaris Sparc server, which crashes occasionally -
once every week or two, always during the night. At this particular time a large
number of operations are performed, including mass deletes and adds. I haven't
been able to reproduce this bug, just watch it happen on the production server
every now and again...
I managed to obtain a coredump, and a backtrace (at the end of this message). I
realize this isn't much to go on, but I'm rather unfamiliar with this part of
the code, so I wondered if anyone has an idea what's going on here?
FWIW, the dynlist and chain overlays are in use on the server, and the database
is bdb, with a syncrepl consumer as well as syncprov overlay.
Backtrace follows:
8<-------------------------------------------------------------
Thread 1 (process 1054014 ):
#0 0xfee4aa58 in _lwp_kill () from /lib/libc.so.1
#1 0xfede5a64 in raise () from /lib/libc.so.1
#2 0xfedc1954 in abort () from /lib/libc.so.1
#3 0xfedc1b90 in _assert () from /lib/libc.so.1
#4 0xff30ef44 in ldap_pvt_runqueue_resched (rq=0x16c630, entry=0xee6c0a0,
defer=0) at rq.c:165
#5 0xfe7f4a94 in syncprov_qstart (so=0x10acb540) at syncprov.c:933
#6 0xfe7f4d6c in syncprov_qresp (opc=0x1b1bfaf8, so=0x10acb540, mode=2) at
syncprov.c:982
#7 0xfe7f5aa4 in syncprov_matchops (op=0xf6bffa50, opc=0x1b1bfaf8, saveit=0) at
syncprov.c:1175
#8 0xfe7f7490 in syncprov_op_response (op=0xf6bffa50, rs=0xf6bff644) at
syncprov.c:1561
#9 0x000575cc in ?? ()
#10 0x000575cc in ?? ()
8<-------------------------------------------------------------
Thanks in advance for any pointers!
Regards,
Jonathan
14 years
Re: (ITS#6133) back-relay issues
by h.b.furuseth@usit.uio.no
Questions:
* relay_back_operational() sets up callbacks. Should it?
Looks harmless, but as far as I can tell, be->be_operational()
functions do not use them, since they (should) send no response.
* There is no relay_back_chk_controls(). Should there be?
Though I think DNs would then be rewritten four times the same way
for each operation:-( Already operational, has_subordinates and
finally the operation itself does. And possibly for access controls.
I've factored op.c code out to table-driven handlers and a macro,
and cleaned away those '#if 0's.
Fixed more problems:
* Search referrals should have a scope.
* relay_back_op_extended() was (still) broken.
The handler should return a result which caller should send, so it
must set sr->sr_ref without freeing it. Setting REP_REF_MUSTBEFREED
instead, and droping the RB_SEND requirement in fail_mode.
* For readability, fixed return values from relay_back_chk_referrals()
and other unused handlers. (chk_referrals may be unfixable.)
* relay_back_entry_<get/release>_rw() returned operationsError for
failure. Failing with noSuchObject/unwillingToPerform instead.
* relay_back_entry_release_rw() leaked entries when bd->be_release==0.
For paranoia, fixed it only when the entry's e_private == NULL.
> * The handlers for Abandon, Cancel and connection-init/destroy
> should not exist, as far as I can tell. They forward the call to the
> underlying backend database, but that means it receives the call twice:
So does Unbind. Removed the handler.
> * back-relay can be configured to cause infinite recursion. (...)
> Anyway, recursion can now be properly caught with op->o_extra. (...)
Needed a unique key per <operation type, relay database> combination.
Otherwise things like backend_group() called from another operation
failed when looking up a relayed DN via relay_back_entry_get_rw().
That fixed relay_back_operational() and relay_back_has_subordinates(),
or at least I assume that's what their FIXMEs were about.
--
Hallvard
14 years
Re: (ITS#6131) "TLSVerifyClient try" not working with GNU TLS
by subbarao@computer.org
Howard Chu wrote:
>> I was just looking around for a possible explanation to the problem that
>> I'm encountering.
>>
>> I double-checked the version that I was running and it's actually
>> 2.4.15, not 2.4.16. Would there be a significant difference between
>> these two versions with respect to TLS certificate handling?
>
> Yes. Read the 2.4.16 CHANGES.
Ok, I see the following:
Fixed libldap GnuTLS TLSVerifyCilent try (ITS#5981)
Looking at ITS #5981, that seems to be exactly the same problem that I'm
having. I tried searching the openldap.org site before for similar
keywords, but I guess I missed this.
Thanks,
-Kartik
14 years