Full_Name: Quanah Gibson-Mount
Version: 2.4.44
OS: N/A
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (47.208.148.26)
Unfortunately, it frequently is the case that a consumer CSN ends up ahead of
the provider CSN. This can occur for a variety of reasons, but in the case of
databases holding multiple CSNs from the same serverID (delta-syncrepl and
subordinate databases are two examples of where this can be the case), the info
spit out is somewhat lacking:
Oct 28 15:49:55 XXXXXXX slapd[6463]: conn=1819162 op=2 SEARCH RESULT tag=101
err=53 nentries=0 text=consumer state is newer than provider!
It would be exceedingly useful to have better detail than this, so we can see
the consumer CSN and the server CSN that are being compared.
Something like:
diff --git a/servers/slapd/overlays/syncprov.c
b/servers/slapd/overlays/syncprov.c
index f7e422d..c6ef4df 100644
--- a/servers/slapd/overlays/syncprov.c
+++ b/servers/slapd/overlays/syncprov.c
@@ -2654,6 +2654,11 @@ syncprov_op_search( Operation *op, SlapReply *rs )
/* our state is older, complain to consumer */
rs->sr_err = LDAP_UNWILLING_TO_PERFORM;
rs->sr_text = "consumer state is newer than
provider!";
+
+ Log4( LDAP_DEBUG_SYNC, LDAP_LEVEL_INFO,
+ "consumer %d state %s is newer than
provider %d state %s\n",
+ sids[i], srs->sr_state.ctxcsn[i].bv_val,
sids[j], /* == slap_serverID */
+ ctxcsn[j].bv_val);
bailout:
if ( sop ) {
syncops **sp = &si->si_ops;
kwetlesen(a)mac.com wrote:
> Full_Name: Kai M Wetlesen
> Version: 2.4.40
> OS: RHEL 7.2 Maipo
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (198.4.83.52)
>
> For some reason slapd crashes with a segmentation fault when subjected to a TLS
> heavy connection load. It looks like the segmentation fault originates in
> libnss3.so,
It seems you're using the binary RPMs provided by RHEL. Those are linked to
libnss and are furthermore heavily patched by Red Hat.
Assuming you have a RHEL support contract you better file a RHEL bug report first.
Ciao, Michael.
--On Friday, May 15, 2015 7:53 PM +0000 quanah(a)zimbra.com wrote:
Capturing what was discussed of this issue in IRC at the time:
<hyc> too bad. yes, there is a case where a refresh can delete entries that
shouldn't have been deleted
<hyc> we discussed it here a month or so ago
<hyc> but if all servers were in persist phase with delta, that shouldn't
happen
<JoBbZ> hm.. should have been out of refresh
<JoBbZ> it did a refresh @ zimbra.log.30:May 1 14:46:02 zm-mmr01
slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> but that was hours before
<hyc> and it was also behind by 1.5 hours
<hyc> ...
<JoBbZ> is there a message logged when refresh mode ends?
<hyc> that's the message.
<hyc> at least, that is the message from the provider that it's done
sending refresh messages
<hyc> then the consumer runs syncrepl_del_nonpresent based on the provider
messages
<JoBbZ> zimbra.log.30:May 1 18:37:33 zm-mmr01 slapd[43243]:
syncrepl_del_nonpresent: rid=102 be_delete
uid=skoelzer,ou=people,dc=unitedfiber,dc=email (0)
<JoBbZ> to
<JoBbZ> zimbra.log.30:May 1 18:37:35 zm-mmr01 slapd[43243]:
syncrepl_del_nonpresent: rid=101 be_delete
uid=fsodell2,ou=people,dc=ntin,dc=net (32)
<JoBbZ> why would refresh mode end @ 14:146, and then 4 hours later, it
does del_nonpresent?
<hyc> it might have had a lot of refresh messages to parse
<hyc> but 4 hours seems excessive, yeah
<JoBbZ> but it was replicating during that time too
<JoBbZ> May 1 15:00:00 zm-mmr01 slapd[43243]: syncrepl_message_to_op:
rid=102 be_modify uid=gibbonsj,ou=people,dc=bevcomm,dc=net (0)
<JoBbZ> for example
<JoBbZ> but thousands and thousands
<hyc> that's rid=102
<hyc> you were looking at rid=101
<JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> they both finished around the same time
<hyc> this is 4way so where's the 3rd rid
<JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> hm
<JoBbZ> this is interesting..
<JoBbZ> May 1 14:18:45 zm-mmr01 slapd[36133]: do_syncrep2: rid=100
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=101
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=102
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> so it did one sync then
<JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=102
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> another sync then
<JoBbZ> a sync here
<JoBbZ> May 1 18:37:34 zm-mmr01 slapd[43243]: do_syncrep2: rid=102
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 18:37:35 zm-mmr01 slapd[43243]: do_syncrep2: rid=101
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> and one at
<JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> ugh
<hyc> no rid=100 from the last 2?
<JoBbZ> nope
<JoBbZ> let me go look at the rid assignments rq
<hyc> that's suspicious
<JoBbZ> hm.. i bet it is losing connections
<JoBbZ> 101/102 are in nashville
<JoBbZ> 100 is in greenville, along with this server
<JoBbZ> yeah, full fallback triggered at 18:37:04
<JoBbZ> sigh
<JoBbZ> but this was a modrdn and an account add
<JoBbZ> no deletes
<hyc> was the Add the missing entry?
<JoBbZ> yeah
<hyc> yeah, that's the known issue then
<JoBbZ> ok, what about the modrdn?
<hyc> which isn't in an ITS so we should probably use the one you just filed
<hyc> haven't thought about the modrdn but it's probably the same thing.
<JoBbZ> ok
<hyc> in refresh without syncprov sessionlog, the provider has to send the
consumer a list of entryUUIDs of every entry that exists on the provider
<hyc> the consumer then searches every entry in the consumer, compares to
the list of entryUUIDs from the provider, and deletes any entry that's not
in the provider's list
<hyc> the problem is that an entry can be missing from this particular
provider's list because its update was coming from a different provider
<hyc> so it gets deleted even though it was properly received from the
other provider
<JoBbZ> ok, makes sense
<hyc> if you have a syncprov sessionlog that's large enough to cover the
span of disconnected time
<hyc> then the provider sends a list of deleted entryUUIDs instead of
present UUIDs
<hyc> and this bug can't happen then
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
kwetlesen(a)mac.com wrote:
> Full_Name: Kai M Wetlesen
> Version: 2.4.40
> OS: RHEL 7.2 Maipo
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (198.4.83.52)
>
>
> Hi All,
>
> For some reason slapd crashes with a segmentation fault when subjected to a TLS
> heavy connection load. It looks like the segmentation fault originates in
> libnss3.so, but I cannot tell anything more than that.
Issues with OpenLDAP builds shipped in RHEL must be directed to RedHat
support. The OpenLDAP Project does not endorse the use of libnss.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Full_Name: Kai M Wetlesen
Version: 2.4.40
OS: RHEL 7.2 Maipo
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (198.4.83.52)
Hi All,
For some reason slapd crashes with a segmentation fault when subjected to a TLS
heavy connection load. It looks like the segmentation fault originates in
libnss3.so, but I cannot tell anything more than that.
AOAOct 28 11:48:57 ldap-primary.domain slapd[31842]: conn=2259 fd=26 closed
(TLS negotiation failure)
Oct 28 11:48:57 ldap-primary.domain slapd[31842]: conn=2260 fd=26 ACCEPT from
IP=XXX.XXX.205.98:49696 (IP=0.0.0.0:636)
(about 800 more lines like this)AOAOct 28 11:53:51 ldap-primary.domain
slapd[31842]: conn=2671 fd=33 ACCEPT from IP=XXX.XXX.205.98:60921
(IP=0.0.0.0:636)
Oct 28 11:53:51 ldap-primary.domain slapd[31842]: conn=2670 fd=34 closed (TLS
negotiation failure)
Oct 28 11:53:51 ldap-primary.domain slapd[31842]: conn=2672 fd=34 ACCEPT from
IP=XXX.XXX.205.98:60926 (IP=0.0.0.0:636)
Oct 28 11:53:51 ldap-primary.domain slapd[31842]: conn=2672 fd=34 closed (TLS
negotiation failure)
Oct 28 11:53:51 ldap-primary.domain kernel: slapd[32180]: segfault at 10 ip
00007f83554fcc65 sp 00007f83367fc550 error 4 in libnss3.so[7f83554b6000+11e000]
Oct 28 11:53:51 ldap-primary.domain systemd[1]: slapd.service: main process
exited, code=killed, status=11/SEGV
The traffic originates from a penetration test machine running Nessus which is
used where this server resides as part of a security sweep. Unfortunately I
don't have visibility as to what exact tests the Nessus server performs, but I
do know that the probes the server a couple hundred times to try and discover
what service is running. The machine is running on a lightly configured but
dedicated VM as this server was never expected to serve more than 400 clients.
Is this expected behavior?
Steps to reproduce:
- Install OpenLDAP
- Configure any DIT
- Configure OpenLDAP only to service ldaps:// using TLS
- Start the server
- Confgure Nessus scanner
- Run a Nessus vulnerability scan against the server
Thanks,
Kai Wetlesen