Re: (ITS#5470) Sporadic failures with RE24
by raphael.ouazana@linagora.com
Hi,
Le Ven 2 mai 2008 11:01, hyc(a)symas.com a écrit :
> luca(a)OpenLDAP.org wrote:
>> luca(a)OpenLDAP.org wrote:
>>> This is a multi-part message in MIME format.
>>> --------------080809000906010300090306
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Howard Chu wrote:
>>>
>>>> Thanks. Please try HEAD again.
>>>>
>>> No way.
>>> new testrun directory in
>>> ftp://ftp.sys-net.it/luca_scamoni_its5470_20080430-new.tgz
>>>
>>> backtrace attached
>>>
>> recent commits seem to have fixed it (at least, right now I'm not able
>> to reproduce it anymore...)
>
> Right. Confirmed here too; I (temporarily) added an assert(0) to the
> offending
> branch of code to make sure the patch was actually getting hit. It takes a
> very particular timing to trigger that code path.
>
> I'm not sure how we can reliably test for this down the road. Perhaps we
> should add a "disabled" config keyword for backends and syncrepl
> consumers, so
> that we can start up the individual servers, (which takes an unpredictable
> amount of time for each) and then enable various parts in a fixed sequence
> (e.g. 1 second sleeps between ldapmodify/enable requests). Even that's hit
> or
> miss, because our test database is so small it's unlikely that we can hit
> the
> window of time on demand.
I'm testing the last RE24 tag. After 201 successful runs of test050, I got
a failure :/
Cleaning up test run directory leftover from previous run.
Running ./scripts/test050-syncrepl-multimaster...
running defines.sh
Initializing server configurations...
Starting producer slapd on TCP/IP port 9011...
Using ldapsearch to check that producer slapd is running...
Inserting syncprov overlay on producer...
Starting consumer slapd on TCP/IP port 9012...
Using ldapsearch to check that consumer slapd is running...
Configuring syncrepl on consumer...
Starting consumer2 slapd on TCP/IP port 9013...
Using ldapsearch to check that consumer2 slapd is running...
Configuring syncrepl on consumer2...
Adding schema and databases on producer...
Using ldapadd to populate producer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapadd to populate consumer...
Waiting 20 seconds for syncrepl to receive changes...
Using ldapsearch to check that syncrepl received database changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
Waiting 5 seconds for syncrepl to receive changes...
ldapsearch failed (32)!
testrun uploaded in
ftp://ftp.openldap.org/incoming/raphael-ouazana-testrun-080505.tgz
Regards,
Raphaël Ouazana.
14 years, 9 months
Re: (ITS#5488) syncrepl received contextCSN not passed on to syncprov consumers
by hyc@symas.com
Rein Tollevik wrote:
> On Wed, 30 Apr 2008, Howard Chu wrote:
>> rein(a)OpenLDAP.org wrote:
>>> My first attempt at fixing this was to change syncprov to fetch the
>>> queued csn values from the glue backend where it was used. But that
>>> failed as other modules queues the csn values in their own backend when
>>> they changes things.
>> What other modules? Generally there cannot be any other sources of changes.
>
> Sorry, I should have written other configurations. The CSNs gets queued
> in the subordinate database when syncrepl is used there, or not at all
> (i.e in regular updates that comes in through the frontend).
OK, but that's again quite a special case. I.e., that's multi-master; in the
default (single-master) case there cannot be regular updates arriving through
the frontend. When a single-master syncrepl consumer is configured, that is
the only possible source of updates. Let's be sure we've solved this question
for the single-master case first, before addressing the multi-master case.
While it's expected that the software will be able to handle multiple glued
DBs and multi-master across them, I seriously doubt that anyone out there
actually knows how to configure and maintain such a setup yet.
>>> Instead I changed ctxcsn.c so that it always
>>> queues them in the glue backend where syncprov is used. But I don't
>>> feel that my understanding of this stuff is good enough to be sure that
>>> this is the optimal solution..
>> I definitely don't like references to the syncprov overlay appearing in main
>> slapd code like that. We need a different solution.
> To me it makes sense to have a single queue of CSN values in a glued
> configuration, no matter if or where syncprov is used.
Yes, I can probably go along with that. The downside is that it may reduce
write concurrency a bit, compared to a glued configuration where each glued DB
is otherwise independent.
> Another approach could be to have syncprov look in the glue database if
> it fails to find any queued CSN in a subordinate db. I haven't tested
> it, but that should work in both configurations. It should also remove
> the need to always look for the glue db which my patch requires. Would
> that be better?
That sounds like a decent alternative.
>>> Btw, in syncprov_checkpoint() there is a similar SLAP_GLUE_SUBORDINATE
>>> test, should that have included an overlay_is_inst() clause as well?
>> Perhaps. You would have to use op->o_bd->bd_self instead of op->o_bd on
>> that call.
> The current test (introduced to fix ITS#5433) causes the contextCSN to
> be written to the glue database when syncprov is used on a subordinate
> db, which appears wrong to me.
Understood.
Again, the question is whether the admin intended to configure a single
syncprov over an entire glued DB, or individual syncprovs over each component
of the glued tree. The distinction is vital, and it's detected based on
whether the syncprov overlay is above the glue overlay in the overlay stack,
or below it, on the topmost DB.
> Could you elaborate on when op->o_bd->bd_self must be used instead of
> op->o_bd? I understand that op->o_bd may be a copy of the original
> structure that op->o_bd->bd_self refers to, but I'm not sure when it
> must be used. Btw, could op->o_bd->bd_self->bd_info be used to fetch
> the BackendInfo that can be used to call the top-most bd_search (and
> similar) also in overlays?
If you read the code for overlay_is_inst() it should be obvious - that
function only works when used with a real BackendDB structure. The local copy
structure has had its bd_info replaced with whatever on_inst structure
corresponds to the current overlay.
Yes, the bd_self points to the topmost structure, so you can use it for
be_search. Much of what's happening in these overlays was intended to avoid
starting over at the top though, because the code is already running in the
desired overlay context.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years, 9 months
Re: (ITS#5487) syncprov_findbase must search the backend from the syncrepl search
by hyc@symas.com
Rein Tollevik wrote:
> On Wed, 30 Apr 2008, Howard Chu wrote:
>
>> rein(a)OpenLDAP.org wrote:
>
>>> syncprov_findbase() must search the backend saved with the syncrepl
>>> operation,
>>> not the one from the operation passed as argument. The backend in the op
>>> argument can be a subordinate database, in which case the search for the
>>> base in
>>> the superior database will fail, and syncrepl consumers will be force to do
>>> a an
>>> unneccessary full refresh of the database.
>> OK.
>>
>>> The patch at the end should fix
>>> this. Note that both fop.o_bd and fop.o_bd->bd_info can be changed by the
>>> overlay_op_walk() call, which is the reason for the long pointer traversal
>>> to
>>> find the correct bd_info to save and restore.
>
>> But the overlay_op_walk call is only appropriate when the DB to be searched
>> is the current database, and the current DB is an overlay DB structure.
>
> Ah, the changing of the BackendDB->bd_info that takes place when
> overlays are called feels like an open pit I manage to fall into every
> time I get close to it... I wish it could be replaced in a future
> version.
Agreed, it would have been safer as an Op-specific field, but that would have
caused quite a lot of disruption to all existing backend code.
> A new patch that I hope should fix this is at the end. It always use
> be_search, after putting back the original bd_info if needed. I feel
> that using the generic be_search is better than interfering directly
> with the overlay code as overlay_op_walk does. I also tested for
> SLAP_ISOVERLAY rather than PS_IS_REFRESHING, as that appeared more
> generic to me. But again, I may be totally wrong here. Does this patch
> look better?
SLAP_IS_OVERLAY will never be true here. That flag is only set when the
BackendDB being tested is a local copy of a real BackendDB structure. The
structure referenced in s_op is always a real BackendDB.
In fact, if you're always going to use s_op and be_search, there's no further
work needed, because the regular overlay infrastructure will always make a new
local BackendDB copy itself. (And of course, some of that would be wasted
effort, which is why the original code uses overlay_op_walk. Since op->o_bd is
already an overlay DB, there's no need to make yet another copy for the
first-search case.)
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years, 9 months
Re: (ITS#5488) syncrepl received contextCSN not passed on to syncprov consumers
by rein@OpenLDAP.org
On Wed, 30 Apr 2008, Howard Chu wrote:
> rein(a)OpenLDAP.org wrote:
>> When syncrepl and syncprov are both used on a glue database, the
>> contextCSN received from the syncrepl producers are not passed on to the
>> syncprov consumers when changes in subordinate databases are received.
>> The reason is that syncrepl queues the CSNs in the glue backend, while
>> syncprov fetches them from the backend where the changes are made. As a
>> consequence, the consumers will be passed a cookie without any csn
>> value.
>>
>> My first attempt at fixing this was to change syncprov to fetch the
>> queued csn values from the glue backend where it was used. But that
>> failed as other modules queues the csn values in their own backend when
>> they changes things.
>
> What other modules? Generally there cannot be any other sources of changes.
Sorry, I should have written other configurations. The CSNs gets queued
in the subordinate database when syncrepl is used there, or not at all
(i.e in regular updates that comes in through the frontend).
>> Instead I changed ctxcsn.c so that it always
>> queues them in the glue backend where syncprov is used. But I don't
>> feel that my understanding of this stuff is good enough to be sure that
>> this is the optimal solution..
>
> I definitely don't like references to the syncprov overlay appearing in main
> slapd code like that. We need a different solution.
That's reasonable, but the test for syncrepl is probably not needed if
this solution should be kept. The test was more or less a copy and
paste from syncrepl where it finds out which backend to write through.
To me it makes sense to have a single queue of CSN values in a glued
configuration, no matter if or where syncprov is used.
> At one point in the past, I had changed syncrepl.c to queue the CSNs in
> both places, but that seemed rather sloppy. Still, it may work best here.
I don't like duplicating information, sooner or later it tends to end up
with wrong info in one of the places..
Another approach could be to have syncprov look in the glue database if
it fails to find any queued CSN in a subordinate db. I haven't tested
it, but that should work in both configurations. It should also remove
the need to always look for the glue db which my patch requires. Would
that be better?
>> Btw, in syncprov_checkpoint() there is a similar SLAP_GLUE_SUBORDINATE
>> test, should that have included an overlay_is_inst() clause as well?
>
> Perhaps. You would have to use op->o_bd->bd_self instead of op->o_bd on
> that call.
The current test (introduced to fix ITS#5433) causes the contextCSN to
be written to the glue database when syncprov is used on a subordinate
db, which appears wrong to me.
Could you elaborate on when op->o_bd->bd_self must be used instead of
op->o_bd? I understand that op->o_bd may be a copy of the original
structure that op->o_bd->bd_self refers to, but I'm not sure when it
must be used. Btw, could op->o_bd->bd_self->bd_info be used to fetch
the BackendInfo that can be used to call the top-most bd_search (and
similar) also in overlays?
Rein
14 years, 9 months
(ITS#5494) slapd crashed when accessed by multiple threads
by adejong@debian.org
Full_Name: Arthur de Jong
Version: 2.4.7
OS: Debian unstable
URL: http://arthurenhella.demon.nl/nss-ldapd/adejong-slapd-crash.log
Submission from: (NULL) (83.160.165.27)
This has also been submitted as a Debian bug:
http://bugs.debian.org/479237
My test slapd consistently crashes when doing multiple simultaneous
requests in different threads. Each thread has it's own LDAP *ld
connection to the LDAP server which is supposed to be supported [1]. In
any case this shouldn't crash the LDAP server.
[1] http://www.openldap.org/lists/openldap-software/200606/msg00252.html
This problem arises in my test suite for nss-ldapd. Source can be
checked out at http://arthurenhella.demon.nl/svn/nss-ldapd/ (svn) and
the test file is (test/test_myldap.c). It uses a wrapper module (myldap)
around calls to OpenLDAP to simplify memory management. The function
that triggers the crash is test_threads().
I have captured the crash in gdb:
# gdb /usr/sbin/slapd
GNU gdb 6.8-debian
[...]
This GDB was configured as "i486-linux-gnu"...
(gdb) r -d 1 -h ldap:/// ldaps:/// ldapi:/// -g openldap -u openldap -f
/etc/ldap/slapd.conf
Starting program: /usr/sbin/slapd -d 1 -h ldap:/// ldaps:/// ldapi:/// -g
openldap -u openldap -f /etc/ldap/slapd.conf
[Thread debugging using libthread_db enabled]
[New Thread 0xb7b3a930 (LWP 1542)]
@(#) $OpenLDAP: slapd 2.4.7 (Apr 16 2008 08:13:31) $
@minerva.hungry.com:/home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/debian/build/servers/slapd
ldap_pvt_gethostbyname_a: host=sorbet, r=0
daemon_init: listen on ldap:///
daemon_init: 1 listeners to open...
[...]
<= send_search_entry: conn 2 exit.
entry_decode: "cn=Zaka Eddins+uid=zeddins,ou=lotsofpeople,dc=test,dc=tld"
<= entry_decode(cn=Zaka Eddins+uid=zeddins,ou=lotsofpeople,dc=test,dc=tld)
=> send_search_entry: conn 2 dn="cn=Zaka
Eddins+uid=zeddins,ou=lotsofpeople,dc=test,dc=tld"
ber_flush2: 107 bytes to sd 18
<= send_search_entry: conn 2 exit.
entry_decode: "uid=wvakil,ou=lotsofpeople,dc=test,dc=tld"
<= entry_decode(uid=wvakil,ou=lotsofpeople,dc=test,dc=tld)
=> send_search_entry: conn 2 dn="uid=wvakil,ou=lotsofpeople,dc=test,dc=tld"
ber_flush2: 90 bytes to sd 18
<= send_search_entry: conn 2 exit.
entry_decode: "uid=zmeeker,ou=lotsofpeople,dc=test,dc=tld"
<= entry_decode(uid=zmeeker,ou=lotsofpeople,dc=test,dc=tld)
=> send_search_entry: conn 2 dn="uid=zmeeker,ou=lotsofpeople,dc=test,dc=tld"
ber_flush2: 92 bytes to sd 18
<= send_search_entry: conn 2 exit.
bdb_search: 1104 scope not okay
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb5f18b90 (LWP 5017)]
0xb7cef160 in pthread_mutex_lock () from /lib/libpthread.so.0
(gdb) bt
#0 0xb7cef160 in pthread_mutex_lock () from /lib/libpthread.so.0
#1 0xb7f4351d in ldap_pvt_thread_mutex_lock () from
/usr/lib/libldap_r-2.4.so.2
#2 0xb783883d in bdb_cache_return_entry_rw (bdb=0x81ea358, e=0x820922c, rw=0,
lock=0xb5f16fd4)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/back-bdb/cache.c:256
#3 0xb782ce12 in bdb_search (op=0x8299b10, rs=0xb5f18168)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/back-bdb/search.c:909
#4 0x08077d13 in fe_op_search (op=0x8299b10, rs=0xb5f18168)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/search.c:368
#5 0x0807853c in do_search (op=0x8299b10, rs=0xb5f18168)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/search.c:217
#6 0x080757c6 in connection_operation (ctx=0xb5f18248, arg_v=0x8299b10)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/connection.c:1083
#7 0x08075ed6 in connection_read_thread (ctx=0xb5f18248, argv=0x13)
at /home/pere/src/debiancvs/initscripts-ng-svn/trunk/src/insserv/openldap2.3-2.4.7/servers/slapd/connection.c:1210
#8 0xb7f42a44 in ?? () from /usr/lib/libldap_r-2.4.so.2
#9 0xb5f18248 in ?? ()
#10 0x00000013 in ?? ()
#11 0x00000000 in ?? ()
A more detailed backtrace is available at the url specified below.
14 years, 9 months
Re: (ITS#5487) syncprov_findbase must search the backend from the syncrepl search
by rein@OpenLDAP.org
On Wed, 30 Apr 2008, Howard Chu wrote:
> rein(a)OpenLDAP.org wrote:
>> syncprov_findbase() must search the backend saved with the syncrepl
>> operation,
>> not the one from the operation passed as argument. The backend in the op
>> argument can be a subordinate database, in which case the search for the
>> base in
>> the superior database will fail, and syncrepl consumers will be force to do
>> a an
>> unneccessary full refresh of the database.
>
> OK.
>
>> The patch at the end should fix
>> this. Note that both fop.o_bd and fop.o_bd->bd_info can be changed by the
>> overlay_op_walk() call, which is the reason for the long pointer traversal
>> to
>> find the correct bd_info to save and restore.
> But the overlay_op_walk call is only appropriate when the DB to be searched
> is the current database, and the current DB is an overlay DB structure.
Ah, the changing of the BackendDB->bd_info that takes place when
overlays are called feels like an open pit I manage to fall into every
time I get close to it... I wish it could be replaced in a future
version.
> Your patch causes fc->fss->s_op->o_bd's bd_info pointer to change, which is
> not allowed. That's in the original backendDB, which must be treated as
> read-only since multiple threads may be accessing it. The correct approach
> here is to use a new local backendDB variable, copy the s_op->o_bd into it,
> and then just do a regular be_search invocation instead of using
> overlay_op_walk.
>
> But, this patch must not take effect on the first call to syncprov_findbase
> (which occurred in syncprov_op_search) - in that case, the current code is
> correct. So, you need to tweak things based on whether (s_flags &
> PS_IS_REFRESHING) is true or not - if true, this is the first search, and it
> should use the original code. Else, it must use be_search.
A new patch that I hope should fix this is at the end. It always use
be_search, after putting back the original bd_info if needed. I feel
that using the generic be_search is better than interfering directly
with the overlay code as overlay_op_walk does. I also tested for
SLAP_ISOVERLAY rather than PS_IS_REFRESHING, as that appeared more
generic to me. But again, I may be totally wrong here. Does this patch
look better?
Rein
Index: OpenLDAP/servers/slapd/overlays/syncprov.c
===================================================================
RCS file: /f/CVSROOT/drift/OpenLDAP/servers/slapd/overlays/syncprov.c,v
retrieving revision 1.1.1.18
diff -u -u -r1.1.1.18 syncprov.c
--- OpenLDAP/servers/slapd/overlays/syncprov.c 30 Apr 2008 11:17:58 -0000 1.1.1.18
+++ OpenLDAP/servers/slapd/overlays/syncprov.c 2 May 2008 11:19:46 -0000
@@ -404,7 +404,7 @@
slap_callback cb = {0};
Operation fop;
SlapReply frs = { REP_RESULT };
- BackendInfo *bi;
+ BackendDB be;
int rc;
fc->fss->s_flags ^= PS_FIND_BASE;
@@ -413,10 +413,15 @@
fop = *fc->fss->s_op;
fop.o_hdr = op->o_hdr;
- fop.o_bd = op->o_bd;
fop.o_time = op->o_time;
fop.o_tincr = op->o_tincr;
- bi = op->o_bd->bd_info;
+
+ if ( SLAP_ISOVERLAY( fop.o_bd )) {
+ slap_overinst *on = (slap_overinst *)fop.o_bd->bd_info;
+ be = *fop.o_bd;
+ be.bd_info = (BackendInfo *)on->on_info;
+ fop.o_bd = &be;
+ }
cb.sc_response = findbase_cb;
cb.sc_private = fc;
@@ -434,8 +439,7 @@
fop.ors_filter = &generic_filter;
fop.ors_filterstr = generic_filterstr;
- rc = overlay_op_walk( &fop, &frs, op_search, on->on_info, on );
- op->o_bd->bd_info = bi;
+ rc = fop.o_bd->be_search( &fop, &frs );
} else {
ldap_pvt_thread_mutex_unlock( &fc->fss->s_mutex );
fc->fbase = 1;
14 years, 9 months