Full_Name: Maxime Besson
Version: 2.4.47
OS: Debian Jessie
URL:
ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (2a01:cb00:802:8400:2cbe:3c60:fca6:e50b)
I am running a meta-directory with the following DB configuration. version
2.4.47, LTB build on Ubuntu 16.04
dn: olcDatabase={1}meta,cn=config
objectClass: olcMetaConfig
objectClass: olcDatabaseConfig
objectClass: olcConfig
objectClass: top
olcDatabase: {1}meta
olcSuffix: dc=com
olcAccess: {0}to * by * read
olcRootDN: cn=admin,dc=com
dn: olcMetaSub={0}uri,olcDatabase={1}meta,cn=config
objectClass: olcMetaTargetConfig
objectClass: olcConfig
objectClass: top
olcMetaSub: {0}uri
olcDbURI: ldap://1.2.3.4/dc=example,dc=com
olcDbIDAssertBind: mode=legacy flags=non-prescriptive,proxy-authz-non-critical
bindmethod=simple binddn="cn=admin,dc=example,dc=com"
credentials="XXXXX"
olcDbTimeout: 5
olcDbNetworkTimeout: 3
olcDbNretries: never
olcDbRebindAsUser: true
...
(There are 8 backends in total)
Timeouts were added in order to avoid blocking OpenLDAP completely when one
server becomes completely unavailable. However, since I added them, the slapd
process started crashing every now and then (from a couple hours to a couple of
days), usually during small network interruptions that affect all backends: I
see plenty of reconnect logs shortly before the crashes.
The crash is always immediately preceded by the following log message:
meta_search_dobind_init[{i}]: retrying URI="{url}" DN="{DN}"
{i} is never the same, and {url} and {DN} are the correct settings for backend
i.
The crash itself is an ABRT at the following assert in back-meta/search.c:
1957 assert( candidates[ i ].sr_msgid >= 0
1958 || candidates[ i ].sr_msgid == META_MSGID_CONNECTING );
I have analyzed several core dumps, and found that every single time slapd
crashes, sr_msgid has a value of -1 (META_MSGID_IGNORE), which indeed causes the
assert to fail.
I found that candidates[i]->sr_flags has a value of 3 (META_CANDIDATE +
META_BINDING)
And the msc_mscflags in mc->mc_conns[ i ] are
* 0x100081 for all connections before the one that triggers the crash
* 0x100010 for the candidate that crashes the server
* 0x100080 for all connections after it
I am having trouble reproducing this in a test environment, but it happens
regularly in production, I have tried changing the timeouts, adding a
non-default bind timeout , and disabling retries (they were originally allowed)
but the crashes keep happening. Note that disabling retries (olcDbNretries:
never) still seems to lead to retries in meta_search_dobind_init, since the log
message is still there.
I cannot share the core dumps due to the sensitive information inside them.
However I would gladly extract more information from them if it can help solving
this.