I had a medium-size post describing some problems I'm having with an N-Way setup with 2.4.28, but I saw a post from Quanah that sent me in a new direction so I'm doing some more testing before whining about *that* problem...
But meanwhile... can anyone tell me if seeing errors like the following is normal when replicating cn=config?
On the provider:
Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1027 fd=26 ACCEPT from IP=172.30.96.203:55371 (IP=172.30.96.202:389) Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 BIND dn="cn=config" mech=SIMPLE ssf=0 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 RESULT tag=97 err=0 text= Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 DISCONNECT tag=101 err=2 text=controls require LDAPv3 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 do_search: get_ctrls failed Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 fd=27 closed (operations error)
... and on the consumer:
Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 LDAP_RES_SEARCH_RESULT (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrepl: rid=001 rc -2 retrying (3 retries left)
Brandon Hume wrote:
I had a medium-size post describing some problems I'm having with an N-Way setup with 2.4.28, but I saw a post from Quanah that sent me in a new direction so I'm doing some more testing before whining about *that* problem...
But meanwhile... can anyone tell me if seeing errors like the following is normal when replicating cn=config?
No. Errors are by definition not normal.
The test suite tests these types of replication setups. Does "make test" pass on your build?
On the provider:
Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1027 fd=26 ACCEPT from IP=172.30.96.203:55371 (IP=172.30.96.202:389) Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 BIND dn="cn=config" mech=SIMPLE ssf=0 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 RESULT tag=97 err=0 text= Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 DISCONNECT tag=101 err=2 text=controls require LDAPv3 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 do_search: get_ctrls failed Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 fd=27 closed (operations error)
... and on the consumer:
Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 LDAP_RES_SEARCH_RESULT (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrepl: rid=001 rc -2 retrying (3 retries left)
On Mon, Dec 12, 2011 at 10:26:16AM -0800, Howard Chu wrote:
Brandon Hume wrote:
I had a medium-size post describing some problems I'm having with an N-Way setup with 2.4.28, but I saw a post from Quanah that sent me in a new direction so I'm doing some more testing before whining about *that* problem...
But meanwhile... can anyone tell me if seeing errors like the following is normal when replicating cn=config?
No. Errors are by definition not normal.
An intermittent "will not perform" error on multimaster replication can be normal. (I grant I could be mistaken about something.) Consider for MMR hosts a,b:
ldapmodify on a a tries to replicate from b a is newer than b and will not replicate older data a grumbles about "will not perform" b replicates from a, gets latest changes a replicates from b, no changes to replicate
The test suite tests these types of replication setups. Does "make test" pass on your build?
On the provider:
Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1027 fd=26 ACCEPT from IP=172.30.96.203:55371 (IP=172.30.96.202:389) Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 BIND dn="cn=config" mech=SIMPLE ssf=0 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=0 RESULT tag=97 err=0 text= Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 DISCONNECT tag=101 err=2 text=controls require LDAPv3 Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 op=1 do_search: get_ctrls failed Nov 30 14:41:22 kil-ds-3 slapd[7540]: conn=1026 fd=27 closed (operations error)
... and on the consumer:
Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 LDAP_RES_SEARCH_RESULT (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrep2: rid=001 (2) Protocol error Nov 30 14:41:22 kil-ds-4 slapd[8178]: do_syncrepl: rid=001 rc -2 retrying (3 retries left)
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
On 12/12/11 02:26 PM, Howard Chu wrote:
But meanwhile... can anyone tell me if seeing errors like the following is normal when replicating cn=config?
No. Errors are by definition not normal.
That's good to establish, other projects sometimes disagree. :)
The test suite tests these types of replication setups. Does "make test" pass on your build?
With flying colours. I'm inserting Debug() statements all over the place to figure out where the "downgrade" happens, since gdb apparently affects things enough to make the issue more miss than hit. As near as I can tell, the "Operations" structure is coming out of slap_op_alloc() with op->o_hdr->oh_protocol with "2" already set when do_search() is called.
Can you confirm whether Operations structures are meant to be recycled?
To explain, these servers are being monitored by Nagios, which does a simple bind and search every five minutes. It *only* uses LDAPv2 (I didn't write the test, I think it came with Nagios).
I'm only going by the pointer, but it seems like the Operations structure gets recycled between these LDAPv2 connections and my LDAPv3 syncrepl query, and the protocol value is carried over. Then things explode. I've found code that initializes oh_protocol if the value isn't set, but nothing if it already has a "valid" value.
So I'm trying to figure out if: a) I'm getting the wrong Op structure belonging to a different connection; b) I'm getting a recycled Op structure that isn't cleaned up properly, or c) if there's some internal memory corruption happening, possibly as a bug within Linux or VMWare.
--On Tuesday, December 13, 2011 10:58 AM -0400 Brandon Hume hume-ol@bofh.ca wrote:
So I'm trying to figure out if: a) I'm getting the wrong Op structure belonging to a different connection; b) I'm getting a recycled Op structure that isn't cleaned up properly, or c) if there's some internal memory corruption happening, possibly as a bug within Linux or VMWare.
Ops are recycled, it is probably an initialization error. Please file an ITS.
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Tuesday, December 13, 2011 3:52 PM -0800 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On Tuesday, December 13, 2011 10:58 AM -0400 Brandon Hume hume-ol@bofh.ca wrote:
So I'm trying to figure out if: a) I'm getting the wrong Op structure belonging to a different connection; b) I'm getting a recycled Op structure that isn't cleaned up properly, or c) if there's some internal memory corruption happening, possibly as a bug within Linux or VMWare.
Ops are recycled, it is probably an initialization error. Please file an ITS.
Never mind, I see you already filed ITS#7107. There should be a fix checked in for this in the next day or so.
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
openldap-technical@openldap.org