(ITS#8673) Need duration logging for sync ops
by quanah@openldap.org
Full_Name: Quanah Gibson-Mount
Version: 2.4.45
OS: Linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (47.208.148.239)
We log the duration of an operation at stats level which allows being able to
see how long a write OP took on a master. However we do not have any duration
logging at the sync log level and therefore cannot determine how long a write op
takes on a replica for the same modification. This is problematic when
attempting to debug issues related to replication.
6 years, 5 months
Re: (ITS#8444) Out-of-sync issue with memberOf overlay, Delta-syncrepl MMR and >2 nodes
by okuznik@symas.com
A more self-contained log of the same issue, available at
ftp://ftp.openldap.org/incoming/its8444.log
(line numbers below are against current master, commit
91f4d3a6b75e73bf4ea498e83e2e4cb4e7a320e0)
There are some things that occur in all the failures I have seen so far:
- the server that received the operation (#1) sends the accesslog entry
with no CSN in the cookie, then another provider (#2) picks up this
message and relays it to its consumers, this one with a CSN in the
cookie
- a consumer picks up these two in short succession, in the log above,
processing of the one from #2 is finished first (they are being
processed concurrently)
Usually, once one of them gets processed, the new CSN should be noted
and the other threads should just skip it (syncrepl.c:943 and onwards).
In this one, having no CSN in the cookie seems to allow both to process
so far as to run syncrepl_message_to_op(), and one of them will then
inevitably fail to add the entry.
I don't understand yet why server #1 sends the operations without a CSN
in the cookie and (especially if I reorder the overlays to set up
memberof last), the race goes the other way around and the operation to
fail is the one from server #2.
My take on it was that in a delta-sync environment all entries would be
passed with a new CSN and that should end up in the cookie, allowing
syncrepl.c:986 to do its job.
--=20
Ond=C5=99ej Kuzn=C3=ADk
Senior Software Engineer
Symas Corporation http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP
6 years, 5 months
Re: (ITS#8672) syncrepl with openldap 2.4.{40,42} and mdb backend
by quanah@symas.com
--On Thursday, June 08, 2017 2:06 PM +0000 remy.dernat(a)umontpellier.fr
wrote:
> Best regards,
> R?my
Hi,
Thanks for the report. Please confirm you can reproduce the issue with
OpenLDAP 2.4.45. If you can, then please provide your /initial/ cn=config
database prior to doing the modifications as well.
Thanks!
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
6 years, 5 months
(ITS#8672) syncrepl with openldap 2.4.{40,42} and mdb backend
by remy.dernat@umontpellier.fr
Full_Name: Dernat R.my
Version: 2.4.40+dfsg-1+deb8u3 and 2.4.42+dfsg-2ubuntu3.2
OS: Debian and Ubuntu
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (162.38.181.76)
Hi,
Since I moved my OpenLDAP to another server my replication between 2 ldap
servers through syncrepl does not work anymore. I tested many many things.
Finally, I decided to backup the database and restore it on another server (so,
I have 3 ldap servers) and (...) it worked.
After many other tests, I was able to determine the source of this issue. With a
HDB backend on my provider my replication works, while it did not work with a
MDB backend on the provider.
I had this kind of logs on the provider (with MDB):
=============================================================================
Jun 8 12:22:03 ldap2 slapd[15083]: send_search_entry: conn 20855 ber write
failed.
Jun 8 12:24:03 ldap2 slapd[15083]: send_search_entry: conn 20888 ber write
failed.
...
=============================================================================
While, on the slave, I get:
=============================================================================
Jun 8 09:33:32 ldap3-bis slapd[88560]: do_syncrepl: rid=010 rc -1 retrying
Jun 8 09:38:32 ldap3-bis slapd[88560]: do_syncrep2: rid=010 got search entry
without Sync State control (dc=my,dc=domain,dc=com)
Jun 8 09:38:32 ldap3-bis slapd[88560]: do_syncrepl: rid=010 rc -1 retrying
Jun 8 09:43:32 ldap3-bis slapd[88560]: do_syncrep2: rid=010 got search entry
without Sync State control (dc=my,dc=domain,dc=com)
...
=============================================================================
I am able to reproduce the bug quite easily.
I added only two schemas : autofs and quota.
With
=============================================================================
ldapadd -Q -Y EXTERNAL -H ldapi:/// -f autofs.ldif
ldapadd -Q -Y EXTERNAL -H ldapi:/// -f quota.ldif
=============================================================================
I also loaded accesslog module (I am creating a specific directory for
accesslog(*)) on the provider and the syncprov module on both sides.
=============================================================================
(*)
mdkir /var/lib/ldap/accesslog
chown openldap:openldap /var/lib/ldap/accesslog
=============================================================================
Here is what I am doing to setup the syncrepl :
ldapadd -Q -Y EXTERNAL -H ldapi:/// -f file.ldif
With file.ldif, on the provider site (replication.ldif ; replacing HDB with MDB
to test with a MDB backend):
=============================================================================
#Load the syncprov and accesslog modules.
dn: cn=module{0},cn=config
changetype: modify
add: olcModuleLoad
olcModuleLoad: syncprov
-
add: olcModuleLoad
olcModuleLoad: accesslog
# Accesslog database definitions
dn: olcDatabase={2}hdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcHdbConfig
olcDatabase: {2}hdb
olcDbDirectory: /var/lib/ldap/accesslog
olcSuffix: cn=accesslog
olcRootDN: cn=XXXX,dc=YYYY,dc=ZZ
olcDbIndex: default eq
olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart
# Accesslog db syncprov.
dn: olcOverlay=syncprov,olcDatabase={2}hdb,cn=config
changetype: add
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov
olcSpNoPresent: TRUE
olcSpReloadHint: TRUE
# syncrepl Provider for primary db
dn: olcOverlay=syncprov,olcDatabase={1}hdb,cn=config
changetype: add
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov
olcSpNoPresent: TRUE
# accesslog overlay definitions for primary db
dn: olcOverlay=accesslog,olcDatabase={1}hdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcAccessLogConfig
olcOverlay: accesslog
olcAccessLogDB: cn=accesslog
olcAccessLogOps: writes
olcAccessLogSuccess: TRUE
# scan the accesslog DB every day, and purge entries older than 7 days
olcAccessLogPurge: 07+00:00 01+00:00
=============================================================================
On the consumer (with a unique rid, and by replacing HDB with MDB to test with a
MDB backend), the file.ldif looks like:
=============================================================================
#Load the syncprov module.
dn: cn=module{0},cn=config
changetype: modify
add: olcModuleLoad
olcModuleLoad: syncprov
# syncrepl specific indices
dn: olcDatabase={1}hdb,cn=config
changetype: modify
add: olcDbIndex
olcDbIndex: entryUUID eq
-
add: olcSyncRepl
olcSyncRepl: rid=1
provider=ldaps://consumer.mydomain.fr
bindmethod=simple
binddn="cn=XXXXX,dc=mydomain,dc=fr"
credentials=XXXX
searchbase="dc=mydomain,dc=fr" logbase="cn=accesslog"
logfilter="(&(objectClass=auditWriteObject)(reqResult=0))
schemachecking=off
type=refreshAndPersist retry="60 +"
syncdata=accesslog
-
add: olcUpdateRef
olcUpdateRef: ldaps://consumer.mydomain.fr
=============================================================================
On the provider I am using:
=============================================================================
lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.8 (jessie)
Release: 8.8
Codename: jessie
dpkg -l slapd
ii slapd 2.4.40+dfsg-1+de amd64 OpenLDAP server
(slapd)
=============================================================================
Same configuration on one slave, and on the other slave, I am using:
=============================================================================
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial
dpkg -l slapd
ii slapd 2.4.42+dfsg-2ubu amd64 OpenLDAP server
(slapd)
=============================================================================
Best regards,
Rémy
6 years, 5 months
Re: ITS#8504 fix breaks Solaris builds
by quanah@symas.com
--On Wednesday, June 07, 2017 4:11 PM +0000 quanah(a)symas.com wrote:
> --On Wednesday, June 07, 2017 3:35 PM +0200 Hallvard Breien Furuseth
> <h.b.furuseth(a)usit.uio.no> wrote:
>
>> On 06. juni 2017 21:59, quanah(a)openldap.org wrote:
>>> /usr/include/signal.h:233:12: note: declared here
>>> extern int sigwait(sigset_t *);
>>> ^
>>
>> POSIX says it has two arguments:
>> http://pubs.opengroup.org/onlinepubs/9699919799/functions/sigwait.html
>> Is there a two-argument version of sigwait in the Solaris headers
>> somewhere,
>> maybe enabled by #define POSIX_C_SOURCE 1 or something like that?
>
> It's insanely long....
This patch fixed it on my box:
quanah@sol11-3:~/git/sold-2445/openldap$ git diff
diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
index 8a62eff..1976340 100644
--- a/libraries/liblmdb/mdb.c
+++ b/libraries/liblmdb/mdb.c
@@ -113,6 +113,9 @@ typedef SSIZE_T ssize_t;
/* Most platforms have posix_memalign, older may only have memalign */
#define HAVE_MEMALIGN 1
#include <malloc.h>
+#if defined (__sun)
+#define _POSIX_PTHREAD_SEMANTICS 1
+#endif
#endif
#if !(defined(BYTE_ORDER) || defined(__BYTE_ORDER))
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
6 years, 5 months
Re: (ITS#8669) Slapd service becomes unresponsive intermittently
by quanah@symas.com
--On Wednesday, June 07, 2017 8:44 PM +0200 Michael Str=C3=B6der=20
<michael(a)stroeder.com> wrote:
> quanah(a)symas.com wrote:
>>> Is it possible we have the idletimeout set too high and it should be
>>> lowered? I=3DC2=3DB9m wondering if there is some sweet-spot value for =
this
>>> particular setting.
>>
>> I generally leave it unset unless one is encountering an issue of =
running
>> out of connections. Generally, it would be fairly strange for
>> idletimeout to affect things this way at all.
>
> I generally recommend to set idletimeout even somewhat tight in case you
> don't have a strictly defined set of clients. Because a client
> application which does not use its LDAP connection for ~5 min. is most
> times simply not closing connections. And running out of file handles can
> affect all file creation on your system (e.g. creating BDB's transaction
> log files).
Yep, there can be poorly written clients out there. I'd expect idletimeout =
to be completely unrelated, given it's long standing existence and use. ;)
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
6 years, 5 months
Re: (ITS#8669) Slapd service becomes unresponsive intermittently
by michael@stroeder.com
quanah(a)symas.com wrote:
>> Is it possible we have the idletimeout set too high and it should be
>> lowered? I=C2=B9m wondering if there is some sweet-spot value for this
>> particular setting.
>
> I generally leave it unset unless one is encountering an issue of running
> out of connections. Generally, it would be fairly strange for idletimeout
> to affect things this way at all.
I generally recommend to set idletimeout even somewhat tight in case you don't have a
strictly defined set of clients. Because a client application which does not use its LDAP
connection for ~5 min. is most times simply not closing connections. And running out of
file handles can affect all file creation on your system (e.g. creating BDB's transaction
log files).
Only the original poster can find out with monitoring.
One can find out stale connections via back-monitor in sub-tree
cn=Connections,cn=Monitor. IITC attribute 'monitorConnectionActivityTime' contains last
client access time on this connection.
(Ummh, I have to add this to my own monitoring script...)
And of course normal system monitoring of file handles would be also helpful.
Ciao, Michael.
(Keep repeating this mantra: monitoring, monitoring, monitoring, monitoring…)
6 years, 5 months
Re: (ITS#8669) Slapd service becomes unresponsive intermittently
by quanah@symas.com
--On Wednesday, June 07, 2017 7:07 AM -0600 Joaquin Estrada=20
<jmestrada69(a)gmail.com> wrote:
> Quanah,
>
> We are running the Berkley DB back-end, =C2=B3back-bdb=C2=B2 in the =
slapd.conf
> file.
>
> Our server vendor did the upgrade to version 2.4.39 last year in April. =
In
> asking them about upgrading to a newer version, as a potential fix, I was
> told the last version in the RHEL repository that they can upgrade to is
> 2.4.40. I=C2=B9m not certain that our vendor will support our choice to
> upgrade to a newer version than what RHEL provides them in the
> repository, but if it will fix our problem, I=C2=B9ll have to push the
> envelope on that matter.
>
> Were there any known fragmentation issues with back-bdb in the 2.4.39
> version that could also be causing these pauses?
No, back-bdb is not remotely the same as back-mdb. However, I've no idea=20
what options RedHat compiles their BDB library with and there were specific =
options that had an effect on OpenLDAP. Generally, I would note that the=20
back-bdb backend and back-hdb backends are deprecated at this point.
> Is it possible we have the idletimeout set too high and it should be
> lowered? I=C2=B9m wondering if there is some sweet-spot value for this
> particular setting.
I generally leave it unset unless one is encountering an issue of running=20
out of connections. Generally, it would be fairly strange for idletimeout=20
to affect things this way at all. It simply drops idle connections based=20
off of the timer. Disabling rate throttling in rsyslogd is a good idea,=20
but may be unrelated as well. We've also seen cases with RHEL7 where Redhat =
has set things up so that journald also gets all the syslog messages, which =
causes severe performance degredation.
You could spend some time seeing if you can isolate an exact cause. For=20
example, set loglevel to 0 and see if you still encounter the issue. If=20
you do, it is unrelated to syslog activity.
Another test would be to set idletimeout to 0. If you still encounter the=20
issue, it is unrelated to idle connections being dropped. etc.
As Michael noted, Redhat builds are somewhat questionable as they make=20
various changes to the code base that the OpenLDAP project have not been=20
reviewed. Your issues may or may not be related to such a change, it's=20
generally impossible to know.
Hope that helps.
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
6 years, 5 months
Re: ITS#8504 fix breaks Solaris builds
by quanah@symas.com
--On Wednesday, June 07, 2017 3:35 PM +0200 Hallvard Breien Furuseth
<h.b.furuseth(a)usit.uio.no> wrote:
> On 06. juni 2017 21:59, quanah(a)openldap.org wrote:
>> /usr/include/signal.h:233:12: note: declared here
>> extern int sigwait(sigset_t *);
>> ^
>
> POSIX says it has two arguments:
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/sigwait.html
> Is there a two-argument version of sigwait in the Solaris headers
> somewhere,
> maybe enabled by #define POSIX_C_SOURCE 1 or something like that?
It's insanely long....
/*
* sigwait() prototype is defined here.
*/
#if defined(__EXTENSIONS__) || (!defined(_STRICT_STDC) && \
!defined(__XOPEN_OR_POSIX)) || (_POSIX_C_SOURCE - 0 >= 199506L) || \
defined(_POSIX_PTHREAD_SEMANTICS)
#if defined(__STDC__)
#if (_POSIX_C_SOURCE - 0 >= 199506L) ||
defined(_POSIX_PTHREAD_SEMANTICS)
#ifdef __PRAGMA_REDEFINE_EXTNAME
#pragma redefine_extname sigwait __posix_sigwait
extern int sigwait(const sigset_t *_RESTRICT_KYWD, int *_RESTRICT_KYWD);
#else /* __PRAGMA_REDEFINE_EXTNAME */
extern int __posix_sigwait(const sigset_t *_RESTRICT_KYWD,
int *_RESTRICT_KYWD);
#ifdef __lint
#define sigwait __posix_sigwait
#else /* !__lint */
static int
sigwait(const sigset_t *_RESTRICT_KYWD __setp, int *_RESTRICT_KYWD __signo)
{
return (__posix_sigwait(__setp, __signo));
}
#endif /* !__lint */
#endif /* __PRAGMA_REDEFINE_EXTNAME */
#else /* (_POSIX_C_SOURCE - 0 >= 199506L) || ... */
extern int sigwait(sigset_t *);
#endif /* (_POSIX_C_SOURCE - 0 >= 199506L) || ... */
#else /* __STDC__ */
#if (_POSIX_C_SOURCE - 0 >= 199506L) ||
defined(_POSIX_PTHREAD_SEMANTICS)
#ifdef __PRAGMA_REDEFINE_EXTNAME
#pragma redefine_extname sigwait __posix_sigwait
extern int sigwait();
#else /* __PRAGMA_REDEFINE_EXTNAME */
extern int __posix_sigwait();
#ifdef __lint
#define sigwait __posix_sigwait
#else /* !__lint */
static int
sigwait(__setp, __signo)
sigset_t *__setp;
int *__signo;
{
return (__posix_sigwait(__setp, __signo));
}
#endif /* !__lint */
#endif /* __PRAGMA_REDEFINE_EXTNAME */
#else /* (_POSIX_C_SOURCE - 0 >= 199506L) || ... */
extern int sigwait();
#endif /* (_POSIX_C_SOURCE - 0 >= 199506L) || ... */
#endif /* __STDC__ */
#endif /* defined(__EXTENSIONS__) || (!defined(_STRICT_STDC) ... */
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
6 years, 5 months
Re: (ITS#8669) Slapd service becomes unresponsive intermittently
by jmestrada69@gmail.com
Yes, I've reached out to our vendor about this. I am hoping we can sidestep t=
he RHEL releases. Thanks for the info on this.=20
Sent from my iPhone
> On Jun 7, 2017, at 6:16 AM, Michael Str=C3=B6der <michael(a)stroeder.com> wr=
ote:
>=20
> jmestrada69(a)gmail.com wrote:
>> Our server vendor did the upgrade to version 2.4.39 last year in April. I=
n
>> asking them about upgrading to a newer version, as a potential fix, I was=
>> told the last version in the RHEL repository that they can upgrade to is
>> 2.4.40.
>=20
> They seem to just recommend what seems to be the easiest choice for them a=
nd not what
> would be the recommended choice for *you*. RHEL packages are heavily patch=
ed by Red Hat
> and generally not recommended. The upstream developers cannot oversee what=
's the current
> patch state of RHEL packages.
>=20
> =3D> You should kick out your server vendor from doing the OpenLDAP suppor=
t.
>=20
> Ciao, Michael.
6 years, 5 months