--On Friday, May 15, 2015 7:53 PM +0000 quanah@zimbra.com wrote:
Capturing what was discussed of this issue in IRC at the time:
<hyc> too bad. yes, there is a case where a refresh can delete entries that shouldn't have been deleted <hyc> we discussed it here a month or so ago <hyc> but if all servers were in persist phase with delta, that shouldn't happen <JoBbZ> hm.. should have been out of refresh <JoBbZ> it did a refresh @ zimbra.log.30:May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> but that was hours before <hyc> and it was also behind by 1.5 hours <hyc> ... <JoBbZ> is there a message logged when refresh mode ends? <hyc> that's the message. <hyc> at least, that is the message from the provider that it's done sending refresh messages <hyc> then the consumer runs syncrepl_del_nonpresent based on the provider messages <JoBbZ> zimbra.log.30:May 1 18:37:33 zm-mmr01 slapd[43243]: syncrepl_del_nonpresent: rid=102 be_delete uid=skoelzer,ou=people,dc=unitedfiber,dc=email (0) <JoBbZ> to <JoBbZ> zimbra.log.30:May 1 18:37:35 zm-mmr01 slapd[43243]: syncrepl_del_nonpresent: rid=101 be_delete uid=fsodell2,ou=people,dc=ntin,dc=net (32) <JoBbZ> why would refresh mode end @ 14:146, and then 4 hours later, it does del_nonpresent? <hyc> it might have had a lot of refresh messages to parse <hyc> but 4 hours seems excessive, yeah <JoBbZ> but it was replicating during that time too <JoBbZ> May 1 15:00:00 zm-mmr01 slapd[43243]: syncrepl_message_to_op: rid=102 be_modify uid=gibbonsj,ou=people,dc=bevcomm,dc=net (0) <JoBbZ> for example <JoBbZ> but thousands and thousands <hyc> that's rid=102 <hyc> you were looking at rid=101 <JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> they both finished around the same time <hyc> this is 4way so where's the 3rd rid <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> hm <JoBbZ> this is interesting.. <JoBbZ> May 1 14:18:45 zm-mmr01 slapd[36133]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> so it did one sync then <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> another sync then <JoBbZ> a sync here <JoBbZ> May 1 18:37:34 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:37:35 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> and one at <JoBbZ> May 1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> May 1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE <JoBbZ> ugh <hyc> no rid=100 from the last 2? <JoBbZ> nope <JoBbZ> let me go look at the rid assignments rq <hyc> that's suspicious <JoBbZ> hm.. i bet it is losing connections <JoBbZ> 101/102 are in nashville <JoBbZ> 100 is in greenville, along with this server <JoBbZ> yeah, full fallback triggered at 18:37:04 <JoBbZ> sigh <JoBbZ> but this was a modrdn and an account add <JoBbZ> no deletes <hyc> was the Add the missing entry? <JoBbZ> yeah <hyc> yeah, that's the known issue then <JoBbZ> ok, what about the modrdn? <hyc> which isn't in an ITS so we should probably use the one you just filed <hyc> haven't thought about the modrdn but it's probably the same thing. <JoBbZ> ok <hyc> in refresh without syncprov sessionlog, the provider has to send the consumer a list of entryUUIDs of every entry that exists on the provider <hyc> the consumer then searches every entry in the consumer, compares to the list of entryUUIDs from the provider, and deletes any entry that's not in the provider's list <hyc> the problem is that an entry can be missing from this particular provider's list because its update was coming from a different provider <hyc> so it gets deleted even though it was properly received from the other provider <JoBbZ> ok, makes sense <hyc> if you have a syncprov sessionlog that's large enough to cover the span of disconnected time <hyc> then the provider sends a list of deleted entryUUIDs instead of present UUIDs <hyc> and this bug can't happen then
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com