Full_Name: Maxime Besson Version: 2.4.47 OS: Debian Jessie URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (2a01:cb00:802:8400:2cbe:3c60:fca6:e50b)
I am running a meta-directory with the following DB configuration. version 2.4.47, LTB build on Ubuntu 16.04
dn: olcDatabase={1}meta,cn=config objectClass: olcMetaConfig objectClass: olcDatabaseConfig objectClass: olcConfig objectClass: top olcDatabase: {1}meta olcSuffix: dc=com olcAccess: {0}to * by * read olcRootDN: cn=admin,dc=com
dn: olcMetaSub={0}uri,olcDatabase={1}meta,cn=config objectClass: olcMetaTargetConfig objectClass: olcConfig objectClass: top olcMetaSub: {0}uri olcDbURI: ldap://1.2.3.4/dc=example,dc=com olcDbIDAssertBind: mode=legacy flags=non-prescriptive,proxy-authz-non-critical bindmethod=simple binddn="cn=admin,dc=example,dc=com" credentials="XXXXX" olcDbTimeout: 5 olcDbNetworkTimeout: 3 olcDbNretries: never olcDbRebindAsUser: true
...
(There are 8 backends in total)
Timeouts were added in order to avoid blocking OpenLDAP completely when one server becomes completely unavailable. However, since I added them, the slapd process started crashing every now and then (from a couple hours to a couple of days), usually during small network interruptions that affect all backends: I see plenty of reconnect logs shortly before the crashes.
The crash is always immediately preceded by the following log message:
meta_search_dobind_init[{i}]: retrying URI="{url}" DN="{DN}"
{i} is never the same, and {url} and {DN} are the correct settings for backend i.
The crash itself is an ABRT at the following assert in back-meta/search.c:
1957 assert( candidates[ i ].sr_msgid >= 0 1958 || candidates[ i ].sr_msgid == META_MSGID_CONNECTING );
I have analyzed several core dumps, and found that every single time slapd crashes, sr_msgid has a value of -1 (META_MSGID_IGNORE), which indeed causes the assert to fail.
I found that candidates[i]->sr_flags has a value of 3 (META_CANDIDATE + META_BINDING)
And the msc_mscflags in mc->mc_conns[ i ] are
* 0x100081 for all connections before the one that triggers the crash * 0x100010 for the candidate that crashes the server * 0x100080 for all connections after it
I am having trouble reproducing this in a test environment, but it happens regularly in production, I have tried changing the timeouts, adding a non-default bind timeout , and disabling retries (they were originally allowed) but the crashes keep happening. Note that disabling retries (olcDbNretries: never) still seems to lead to retries in meta_search_dobind_init, since the log message is still there.
I cannot share the core dumps due to the sensitive information inside them. However I would gladly extract more information from them if it can help solving this.