(ITS#8493) Under heavy modrdn load, masters desync
by quanah@openldap.org
Full_Name: Quanah Gibson-Mount
Version: 2.4.44+ITS8432
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.111.52.177)
Trying to reproduce another ITS, I discovered a new bug. When doing MODRDN ops
on one master, the other master keeps going out of sync. Specifically:
Sep 3 01:12:17 zre-ldap002 slapd[29206]: syncrepl_message_to_op: rid=100
be_modrdn uid=user.924,ou=people,dc=zre-ldap002,dc=eng,dc=zimbra,dc=com (32)
Sep 3 01:12:17 zre-ldap002 slapd[29206]: do_syncrep2: rid=100 delta-sync lost
sync on (reqStart=20160903051215.747829Z,cn=accesslog), switching to REFRESH
6 years, 8 months
RE: (ITS#8492) signal 11 in ldap_new_connection
by quanah@zimbra.com
--On Friday, September 02, 2016 7:35 AM +0000 htjain(a)sandvine.com wrote:
> After some more debugging looks like ld->ldc->ldc_options.ldo_defludp
> varia= ble is causing this issue, its null when its passed to
> ldap_new_connection. How does it populated?
Please confirm your encounter the issue with a current release. 2.4.40 and
2.4.41 both had significant issues.
--Quanah
--
Quanah Gibson-Mount
6 years, 8 months
RE: (ITS#8492) signal 11 in ldap_new_connection
by htjain@sandvine.com
After some more debugging looks like ld->ldc->ldc_options.ldo_defludp varia=
ble is causing this issue, its null when its passed to ldap_new_connection.
How does it populated?
(gdb) p ld->ldc->ldc_options
$13 =3D {ldo_valid =3D 255, ldo_debug =3D 0, ldo_version =3D 3, ldo_deref =
=3D 0, ldo_timelimit =3D 0, ldo_sizelimit =3D 0, ldo_tm_api =3D {tv_sec =3D=
-1, tv_usec =3D 0}, ldo_tm_net =3D {tv_sec =3D 5, tv_usec =3D 0}, ldo_defl=
udp =3D 0x0, ldo_defport =3D 389,
ldo_defbase =3D 0x0, ldo_defbinddn =3D 0x0, ldo_keepalive_idle =3D 0, ldo=
_keepalive_probes =3D 0, ldo_keepalive_interval =3D 0, ldo_refhoplimit =3D =
5, ldo_sctrls =3D 0x0, ldo_cctrls =3D 0x0, ldo_rebind_proc =3D 0x0, ldo_reb=
ind_params =3D 0x0,
ldo_nextref_proc =3D 0x0, ldo_nextref_params =3D 0x0, ldo_urllist_proc =
=3D 0x0, ldo_urllist_params =3D 0x0, ldo_conn_cbs =3D 0x0, ldo_booleans =3D=
34, ldo_peer =3D 0x0, ldo_cldapdn =3D 0x0, ldo_is_udp =3D 0, ldo_tls_ctx =
=3D 0x0,
ldo_tls_connect_cb =3D 0x0, ldo_tls_connect_arg =3D 0x0, ldo_tls_info =3D=
{lt_certfile =3D 0x0, lt_keyfile =3D 0x0, lt_dhfile =3D 0x0, lt_cacertfile=
=3D 0x0, lt_cacertdir =3D 0x0, lt_ciphersuite =3D 0x0, lt_crlfile =3D 0x0,=
lt_randfile =3D 0x0,
lt_protocol_min =3D 0}, ldo_tls_mode =3D 0, ldo_tls_require_cert =3D 2,=
ldo_tls_impl =3D 0, ldo_tls_crlcheck =3D 0, ldo_def_sasl_mech =3D 0x0, ldo=
_def_sasl_realm =3D 0x0, ldo_def_sasl_authcid =3D 0x0, ldo_def_sasl_authzid=
=3D 0x0,
ldo_sasl_secprops =3D {min_ssf =3D 0, max_ssf =3D 2147483647, maxbufsize =
=3D 16777215, security_flags =3D 17, property_names =3D 0x0, property_value=
s =3D 0x0}}
(gdb) p ld->ldc->ldc_options.ldo_defludp
$14 =3D (LDAPURLDesc *) 0x0
-----Original Message-----
From: openldap-its(a)OpenLDAP.org [mailto:openldap-its@OpenLDAP.org]=20
Sent: Friday, September 2, 2016 11:50 AM
To: Hemant Jain
Subject: Re: (ITS#8492) signal 11 in ldap_new_connection
*** THIS IS AN AUTOMATICALLY GENERATED REPLY ***
Thanks for your report to the OpenLDAP Issue Tracking System. Your report =
has been assigned the tracking number ITS#8492.
One of our support engineers will look at your report in due course.
Note that this may take some time because our support engineers are volunte=
ers. They only work on OpenLDAP when they have spare time.
If you need to provide additional information in regards to your issue repo=
rt, you may do so by replying to this message. Note that any mail sent to =
openldap-its(a)openldap.org with (ITS#8492) in the subject will automatically=
be attached to the issue report.
mailto:openldap-its@openldap.org?subject=3D(ITS#8492)
You may follow the progress of this report by loading the following URL in =
a web browser:
http://www.OpenLDAP.org/its/index.cgi?findid=3D8492
Please remember to retain your issue tracking number (ITS#8492) on any furt=
her messages you send to us regarding this report. If you don't then you'l=
l just waste our time and yours because we won't be able to properly track =
the report.
Please note that the Issue Tracking System is not intended to be used to se=
ek help in the proper use of OpenLDAP Software.
Such requests will be closed.
OpenLDAP Software is user supported.
http://www.OpenLDAP.org/support/
--------------
Copyright 1998-2007 The OpenLDAP Foundation, All Rights Reserved.
6 years, 8 months
(ITS#8492) signal 11 in ldap_new_connection
by htjain@sandvine.com
Full_Name: hemant jain
Version: openldap-2.4.40-8
OS: RHEL 7.2
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (123.63.202.166)
Hi,
We are facing a strange core dump in ldap_new_connection function while trying
to bind a stopped Openldap server.
following is the stack trace:
#0 0x00007f9336898f53 in ldap_new_connection (ld=ld@entry=0x7f931f43cd60,
srvlist=srvlist@entry=0x7f931f48b298, use_ldsb=use_ldsb@entry=1,
connect=connect@entry=1, bind=bind@entry=0x0, m_req=m_req@entry=0,
m_res=m_res@entry=0)
at request.c:481
#1 0x00007f933688530f in ldap_open_defconn (ld=ld@entry=0x7f931f43cd60) at
open.c:41
#2 0x00007f933689a248 in ldap_send_initial_request (ld=ld@entry=0x7f931f43cd60,
msgtype=msgtype@entry=96, dn=dn@entry=0x7f931f827718
"cn=Manager,dc=test,dc=com", ber=ber@entry=0x7f931f457150, msgid=msgid@entry=1)
at request.c:130
#3 0x00007f933688f178 in ldap_sasl_bind (ld=0x7f931f43cd60, dn=0x7f931f827718
"cn=Manager,dc=test,dc=com", mechanism=<optimized out>, cred=0x7f931f4296b0,
sctrls=0x0, cctrls=<optimized out>, msgidp=0x7f92243ed8dc) at sasl.c:148
#4 0x00007f9340f571da in ldap::Connection::Bind
(this=this@entry=0x7f931c043478, msg_id=@0x7f92243ed98c: -1) at
support/ldapconnection.cpp
I have installed corresponding debug rpm for Openldap library.
Following the source info. core dump is coming at line 481 of request.c
file.A%A
for ( srvp = srvlist; *srvp != NULL; srvp = &(*srvp)->lud_next ) // Here srvp is
NULL so *srvp dumping core.
if ( connect ) {
LDAPURLDesc **srvp, *srv = NULL;
async = LDAP_BOOL_GET( &ld->ld_options, LDAP_BOOL_CONNECT_ASYNC );
for ( srvp = srvlist; *srvp != NULL; srvp = &(*srvp)->lud_next ) {
int rc;
rc = ldap_int_open_connection( ld, lc, *srvp, async );
if ( rc != -1 ) {
srv = *srvp;
if ( ld->ld_urllist_proc && ( !async || rc != -2 ) ) {
ld->ld_urllist_proc( ld, srvlist, srvp, ld->ld_urllist_params
);
}
break;
}
}
Any idea how to fix it ?
6 years, 8 months
Re: (ITS#8490) changes not written to accesslog, causing replicas to loop syncing
by quanah@zimbra.com
--On Thursday, September 01, 2016 8:58 AM +0000 quanah(a)zimbra.com wrote:
> In looking at the LDAP accesslog, what I see is that what should have
> been a modRDN op was stored in the accesslog as a MOD op (the one I
> noted before). This seems particularly bizarre, because ldap01 should
> have rejected this change as well. It appears we may have a problem
> where the accesslog DB is updated, but then the change got rejected by
> the unique overlay.
Ok, actually the server (ldap01) accepted the change (renamed the entry).
Looking at the database, it was correctly done as a MODRDN to ldap01, but
/written/ as a MOD to accesslog?!
--Quanah
--
Quanah Gibson-Mount
6 years, 8 months
Re: (ITS#8490) changes not written to accesslog, causing replicas to loop syncing
by quanah@zimbra.com
--On Thursday, September 01, 2016 8:05 AM +0000 quanah(a)zimbra.com wrote:
> --On Thursday, September 01, 2016 7:52 AM +0000 quanah(a)openldap.org wrote:
>
>> Full_Name: Quanah Gibson-Mount
>> Version: OpenLDAP 2.4.44
>> OS: Linux 2.6
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (75.111.52.177)
>>
>>
>> In a 2-node MMR setup. Node 1 is getting a lot of write traffic. Both
>> node 1 and node 2 have 3 replicas each. At some point, a change is
>> received by node 1, which writes the change to its accesslog DB and its
>> primary DB. It's 3 replicas are all correctly updated. MMR node 2
>> receives the change, updates its primary DB, but *fails* to write the
>> change to the accesslog DB. However, it *does* write the CSN update to
>> the accesslog DB successfully. This causes all of its replicas to also
>> update their CSN. Then a change comes in triggering a constraint
>> violation on the replicas, but fully accepted by their master.
>
> So the above summary is incorrect. While 3 replicas did go out of
> sync... 2 belonged to the primary master (node1), and 1 belonged to the
> secondary master (node 2). So really, 4 systems didn't log the change
> (MMR node 2, ldap05, ldap07, ldap09).
Ok, so that's not correct either. I now have the correct topography:
ldap01 has the following replicas: ldap02, ldap05, ldap07, ldap09
ldap02 has the following replicas: ldap01, ldap06, ldap08, ldap10
So the replicas of ldap01 received the change and rejected it. ldap02 just
skipped writing the entry to the accesslog, and as a result, none of its
replicas ever got the change, and thus they never hit the failure issue of
err 19, but they all are now lacking this modification entirely.
I would note that every server was loaded today from the same ldap backup,
so they were all perfectly in sync.
In looking at the LDAP accesslog, what I see is that what should have been
a modRDN op was stored in the accesslog as a MOD op (the one I noted
before). This seems particularly bizarre, because ldap01 should have
rejected this change as well. It appears we may have a problem where the
accesslog DB is updated, but then the change got rejected by the unique
overlay.
--Quanah
--
Quanah Gibson-Mount
6 years, 8 months
Re: (ITS#8490) changes not written to accesslog, causing replicas to loop syncing
by quanah@zimbra.com
--On Thursday, September 01, 2016 7:52 AM +0000 quanah(a)openldap.org wrote:
> Full_Name: Quanah Gibson-Mount
> Version: OpenLDAP 2.4.44
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (75.111.52.177)
>
>
> In a 2-node MMR setup. Node 1 is getting a lot of write traffic. Both
> node 1 and node 2 have 3 replicas each. At some point, a change is
> received by node 1, which writes the change to its accesslog DB and its
> primary DB. It's 3 replicas are all correctly updated. MMR node 2
> receives the change, updates its primary DB, but *fails* to write the
> change to the accesslog DB. However, it *does* write the CSN update to
> the accesslog DB successfully. This causes all of its replicas to also
> update their CSN. Then a change comes in triggering a constraint
> violation on the replicas, but fully accepted by their master.
So the above summary is incorrect. While 3 replicas did go out of sync...
2 belonged to the primary master (node1), and 1 belonged to the secondary
master (node 2). So really, 4 systems didn't log the change (MMR node 2,
ldap05, ldap07, ldap09).
So I have no explanation at the moment why ldap06, ldap08, and ldap10 were
not affected.
--Quanah
--
Quanah Gibson-Mount
6 years, 8 months
(ITS#8490) changes not written to accesslog, causing replicas to loop syncing
by quanah@openldap.org
Full_Name: Quanah Gibson-Mount
Version: OpenLDAP 2.4.44
OS: Linux 2.6
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (75.111.52.177)
In a 2-node MMR setup. Node 1 is getting a lot of write traffic. Both node 1
and node 2 have 3 replicas each. At some point, a change is received by node 1,
which writes the change to its accesslog DB and its primary DB. It's 3 replicas
are all correctly updated. MMR node 2 receives the change, updates its primary
DB, but *fails* to write the change to the accesslog DB. However, it *does*
write the CSN update to the accesslog DB successfully. This causes all of its
replicas to also update their CSN. Then a change comes in triggering a
constraint violation on the replicas, but fully accepted by their master.
Node 1 records the following change in its accesslog DB:
dn: reqStart=20160901025119.904382Z,cn=accesslog
objectClass: auditModify
structuralObctctClass: auditModify
reqStart: 20160901025119.904382Z
reqEnd: 20160901025119.906625Z
reqType: modify
reqSession: 7780
reqAuthzID: uid=zimbra,cn=admins,cn=zimbra
reqDN: uid=renamed-trackp,ou=people,dc=xxxxxxxx,dc=co,dc=za
reqResult: 0
reqM%3: zimbraMailWhitelistMaxNumEntries:= 0
reqMod: zimbraPrefMailLocalDeliveryDisabled:= FALSE
reqMod: zimbraHideInGal:= TRUE
reqMod: userPassword:= {SSHA}
reqMod: mail:= renamed-trackp(a)xxxxxxx.co.za
reqMod: displayName:= trackp(a)xxxxxxxx.co.za
reodod: objectClass:= inetOrgPerson
reqMod: objectClass:= zimbraAccount
reqMod: objectClass:= amavisAccount
reqMod: zimbraMailHost:= store328-dc01.cm.xxxxxxxxxxx
reqMod: zimbraPasswordMustChange:= FALSE
reqMod: cn:= trackp(a)xxxxxxxxxxxxx.co.za
reqMod: zimbraPasswordModifiedTime:= 20160121142933Z
reqMod: zimbraMailStatus:= enabled
reqMod: zimbraIsDelegatedAdminAccount:= FALSE
reqMod: zimbraPrefOutOfOfficeReplyEnabled:= FALSE
reqMod: zimbraMailBlacklistMaxNumEntries:= 0
reqMod: zimbraAccountStatus:= locked
reqMod: zimbraPrefTimeZoneId:= Africa/Harare
reqMod: zimbraId:= 403c7b67-eea0-4315-9b10-a2853c6cf70b
reqMod: zimbraMailTransport:= lmtp:store328-dc01.cm.xxxxxxxxxx:7025
reqMod: sn:= trackp(a)xxxxxxxxxxx.co.za
reqMod: zimbraCOSId:= 1daeb979-365f-46b7-bb3f-2ad46f4576c1
reqMod: zimbraCreateTimestamp:= 20160121142933Z
reqMod: zimbraMailDeliveryAddress:= renamed-trackp(a)xxxxxxx.co.za
reqMod: entryCSN:= 20160901025119.904589Z#000000#002#000000
reqMod: modifiersName:= uid=zimbra,cn=admins,cn=zimbra
reqMod: modifyTimestamp:= 20160901025119Z
reqEntryUUID: 2293af38-5497-1035-8bd6-1d34a55b9433
entryUUID: b46d8728-043a-1036-8ab7-9bf2e127fe3b
creatorsName: cn=config
createTimestamp: 20160901025119Z
entryCSN: 20160901025119.904589Z#000000#002#000000
modifiersName: cn=config
modifyTimestamp: 20160901025119Z
On the other master, we find in the accesslog two sequential CSN updates:
dn: reqStart=20160901025119.904218Z,cn=accesslog
objectClass: auditModify
structuralObjectClass: auditModify
reqStart: 20160901025119.904218Z
reqEnd: 20160901025119.905035Z
reqType: modify
reqSession: 201
reqAuthzID: cn=config
reqDN:
reqResult: 0
reqMod: contextCSN:= 20160901025119.902062Z#000000#002#000000
reqMod: contextCSN:= 20160901015920.689225Z#000000#003#000000
reqEntryUUID: 78143fa2-7af5-48cc-8d5e-6e15f68dd3b2
entryUUID: b46d48a8-043a-1036-81a5-b3dead5e3175
creatorsName: cn=config
createTimestamp: 20160901025119Z
entryCSN: 20160901025119.902062Z#000000#002#000000
modifiersName: cn=config
modifyTimestamp: 20160901025119Z
dn: reqStart=20160901025119.908257Z,cn=accesslog
objectClass: auditModify
structuralObjectClass: auditModify
reqStart: 20160901025119.908257Z
reqEnd: 20160901025119.908439Z
reqType: modify
reqSession: 201
reqAuthzID: cn=config
reqDN:
reqResult: 0
reqMod: contextCSN:= 20160901025119.904589Z#000000#002#000000
reqMod: contextCSN:= 20160901015920.689225Z#000000#003#000000
reqEntryUUID: 78143fa2-7af5-48cc-8d5e-6e15f68dd3b2
entryUUID: b46dcdaa-043a-1036-81a6-b3dead5e3175
creatorsName: cn=config
createTimestamp: 20160901025119Z
entryCSN: 20160901025119.904589Z#000000#002#000000
modifiersName: cn=config
modifyTimestamp: 20160901025119Z
Note the CSN update in the second case is for 20160901025119.904589Z. Note that
this was the CSN of the change that in node 1 was a real update. The atribute
with a constraint in this case is ZimbraID, which is colliding with the zimbraID
of the original entry it was renamed from.
6 years, 8 months