Syncrepl getting stuck on partial sync

List overview All Threads
Download

newer

older

certutil - nsldap + nss trustargs

FW: [External] RE: Help Openldap...

jose Pereira

18 Jul 2024 18 Jul '24

8:17 a.m.

Hello All!

I have multiple setups with a Provider(2.6.4) and consumer(2.6.2) with RefreshAndPersist.

From time to time I see data missing in the consumer e.g only groups are present and all users are missing, even though users were never removed from the Provider, and were present in the Consumer beforehand. On the Provider all data is present and correct. I can still see the incoming connection, from the Consumers Sync, in the Provider without errors. No errors on the Comsumer logs from syncrepl (with -256 logs). Attempted to look in debug (-1 logs), but nothing caught my eye.

When this happened I attempted restart services on both the Provider and the Consumer, disabling and re-enabling the sync on the Consumer. Only clearing the Consumers database seemed to fix the issue. There maybe be sporatic failures in the network connectivity between both systems.

This is the config on the Consumer

olcSyncrepl: {0}rid=000 provider=ldaps://Provider:636 type=refreshAndPersist retry="30 5 300 +" searchbase="ou=subtree,dc=test,dc=net" scope=sub schemachecking=on binddn="cn=Manager,dc=test,dc=net" credentials=credentials keepalive=240:10:30

Any advice on how to troubleshoot this or fix it?

Attachments:

attachment.html (text/html — 5.6 KB)

Show replies by date

Quanah Gibson-Mount

18 Jul 18 Jul

8:37 a.m.

--On Thursday, July 18, 2024 4:17 PM +0000 jose Pereira xicop_86@hotmail.com wrote:

...

When this happened I attempted restart services on both the Provider and the Consumer, disabling and re-enabling the sync on the Consumer. Only clearing the Consumers database seemed to fix the issue. There maybe be sporatic failures in the network connectivity between both systems.

This is the config on the Consumer

olcSyncrepl: {0}rid=000 provider=ldaps://Provider:636 type=refreshAndPersist retry="30 5 300 +" searchbase="ou=subtree,dc=test,dc=net" scope=sub schemachecking=on binddn="cn=Manager,dc=test,dc=net" credentials=credentials keepalive=240:10:30

Any advice on how to troubleshoot this or fix it?

Hi Jose,

What is the syncprov configuration on the provider? For logging you want "stats" and "sync". Is this setup doing partial replication (I.e., ou=subtree is just part of the full DIT under dc=test,dc=net?)

Regards, Quanah

xicop_86＠hotmail.com

8:53 a.m.

Hello Quanah! Somehow never noticed there was specific logging for sync, thanks for the tip!

It is a partial replication, the Provider has multiple subtrees that contain OU's for Users and Groups. The consumer is only replicating one of those subtrees.

As for the syncprov config this is all I have:

dn: olcOverlay={0}syncprov objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpSessionlog: 100 structuralObjectClass: olcSyncProvConfig entryUUID: dd535138-d230-103d-8e6e-557f36e932d7 creatorsName: cn=config createTimestamp: 20230818163436Z entryCSN: 20230818163436.783818Z#000000#000#000000 modifiersName: cn=config modifyTimestamp: 20230818163436Z

Regards, José Pereira

Quanah Gibson-Mount

19 Jul 19 Jul

3:53 p.m.

--On Thursday, July 18, 2024 4:53 PM +0000 xicop_86@hotmail.com wrote:

...

Hello Quanah! Somehow never noticed there was specific logging for sync, thanks for the tip!

It is a partial replication, the Provider has multiple subtrees that contain OU's for Users and Groups. The consumer is only replicating one of those subtrees.

As for the syncprov config this is all I have:

dn: olcOverlay={0}syncprov objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpSessionlog: 100

I don't know how many entries are in your database, the sesionlog may be a bit small. Generally what you want to do is look at what the logs show is happening in this scenario once you have sync logging enabled.

--Quanah

xicop_86＠hotmail.com

24 Jul 24 Jul

4:41 a.m.

Hello! Finally managed to find the issue again, and enabled sync logging, this is what I could see:

slapd starting do_syncrep1: rid=000 starting refresh (sending cookie=rid=000,csn=20240724113017.421010Z#000000#000#000000) do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - REFRESH_DELETE

I am unsure of what it means.

As for the size of the database shouldn't be particularly big, maybe 50 users and 50 groups. I will try to increase the sessionlog as mentioned, but any further advice is very welcome!

BR, José Pereira

Quanah Gibson-Mount

10:49 a.m.

--On Wednesday, July 24, 2024 12:41 PM +0000 xicop_86@hotmail.com wrote:

...

Hello! Finally managed to find the issue again, and enabled sync logging, this is what I could see:

slapd starting do_syncrep1: rid=000 starting refresh (sending cookie=rid=000,csn=20240724113017.421010Z#000000#000#000000) do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - REFRESH_DELETE

I am unsure of what it means.

As for the size of the database shouldn't be particularly big, maybe 50 users and 50 groups. I will try to increase the sessionlog as mentioned, but any further advice is very welcome!

Sync logging has to be enabled from the start (I'd generally advise it always being enabled). I can't tell from your first sentence if you're saying you enabled it after the issue happened.

There should certainly be more than a single line item in regards to sync replication as well.

--Quanah

xicop_86＠hotmail.com

12:28 p.m.

I did only enable it after the issue happened. Also increased the sessionlog on the provider at the same time, not sure if that is a lot or little, but the provider will be managing a lot of info and clients.

I could now actually see more sync entries in the logs.

Firstly the entry I shared before and then hundreds of the following:

do_syncrep2: rid=000 NEW_COOKIE: rid=000,csn=20240724114028.866106Z#000000#000#000000 Jul 24 11:40:28 Consumer slapd[1466367]: slap_queue_csn: queueing 0x7fe8f0109f10 20240724114028.866106Z#000000#000#000000 Jul 24 11:40:28 Consumer slapd[1466367]: slap_graduate_commit_csn: removing 0x7fe8f0109f10 20240724114028.866106Z#000000#000#000000

Followed by hundreds more of the following, but seems to be only listing the groups, not the ou=users. Also found it a bit weird, there is only a single user that was synced (or was not removed as the others), but no sync log about him.

do_syncrep2: rid=000 cookie=rid=000,csn=20240724114029.713635Z#000000#000#000000 Jul 24 11:40:29 Consumer slapd[1466367]: syncrepl_message_to_entry: rid=000 DN: cn=group,ou=Groups,dc=test=net, UUID: 75407828-d95f-103e-8539-f354611c95ec Jul 24 11:40:29 Consumer slapd[1466367]: syncrepl_entry: rid=000 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20240724114029.713635Z#000000#000#000000 tid 0x7fe8fa7fe640 Jul 24 11:40:29 Consumer slapd[1466367]: syncrepl_entry: rid=000 be_search (0) Jul 24 11:40:29 Consumer slapd[1466367]: syncrepl_entry: rid=000 cn=group,ou=Groups,dc=test=net Jul 24 11:40:29 Consumer slapd[1466367]: slap_queue_csn: queueing 0x7fe8ec11f4e0 20240724114029.713635Z#000000#000#000000

And after those, hundreds more of these:

do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - NEW_COOKIE Jul 24 11:40:31 Consumer slapd[1466367]: do_syncrep2: rid=000 NEW_COOKIE: rid=000,csn=20240724114030.988885Z#000000#000#000000 Jul 24 11:40:31 Consumer slapd[1466367]: slap_queue_csn: queueing 0x7fe8ec108eb0 20240724114030.988885Z#000000#000#000000

It is doing this every 20 minutes, with some do_syncrepl: rid=000 rc -1 retrying (2 retries left) in between, Network not too stable, but I saw no error messages or even warnings.

Hopefully this info helps more.

BR, José Pereira

xicop_86＠hotmail.com

30 Jul 30 Jul

12:37 a.m.

Hello Again Quanah, Having this happen more and more, any advice? I was unable to understand any issue from the logs.

BR, José Pereira

xicop_86＠hotmail.com

1:04 a.m.

xicop_86＠hotmail.com wrote:

...

Hello Again Quanah, Having this happen more and more, any advice? I was unable to understand any issue from the logs. BR, José Pereira

Also I was able to capture what I believe is the failure point, these entry seem to exist for all the missing entries:

Jul 30 01:09:17 Consumer slapd[3126]: nonpresent_callback: rid=000 nonpresent UUID fc7ddb2c-c425-103e-8ad3-39e85096018c, dn cn=Jose,ou=Users,ou=Portugal,dc=test,dc=net Jul 30 01:09:17 Consumer slapd[3126]: nonpresent_callback: rid=000 adding entry cn=Jose,ou=Users,ou=Portugal,dc=test,dc=net to non-present list

nonpresent_callback: rid=000 had 56 items left in the list

Jul 30 01:09:17 Consumer slapd[3126]: syncrepl_del_nonpresent: rid=000 be_delete cn=Jose,ou=Users,ou=Portugal,dc=test,dc=net (0) Jul 30 01:09:17 Consumer slapd[3126]: syncrepl_del_nonpresent: rid=000 be_delete cn=test,ou=Groups,ou=Portugal,dc=test,dc=net (0)

BR, José

Ondřej Kuzník

5 Aug 5 Aug

2:25 a.m.

On Tue, Jul 30, 2024 at 08:04:33AM -0000, xicop_86@hotmail.com wrote:

...

xicop_86＠hotmail.com wrote:

...
Hello Again Quanah, Having this happen more and more, any advice? I was unable to understand any issue from the logs.

Also I was able to capture what I believe is the failure point, these entry seem to exist for all the missing entries:

Jul 30 01:09:17 Consumer slapd[3126]: nonpresent_callback: rid=000 nonpresent UUID fc7ddb2c-c425-103e-8ad3-39e85096018c, dn cn=Jose,ou=Users,ou=Portugal,dc=test,dc=net Jul 30 01:09:17 Consumer slapd[3126]: nonpresent_callback: rid=000 adding entry cn=Jose,ou=Users,ou=Portugal,dc=test,dc=net to non-present list

Hi José, the provider claims the entries in question should not be returned given the search issued so the consumer removes them from their copy. This is most often due to misconfigured ACLs.

As a start, you should be running the same search manually (with the same credentials as the consumer) to make sure the full dataset is returned. If they are missing, slapacl tool is your friend when investigating ACL issues.

Regards,

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

xicop_86＠hotmail.com

2 Sep 2 Sep

1:02 a.m.

Hello Ondřej! Thanks for the recommendation, Was not able to test this until now since this appears to be random and I have not been able to force the issue.

I now tested what you mentioned in a failed Consumer, and the search works fine and returns all the data, whether I do the search on the Provider or the Consumer slapacl also returns no issues.

Everytime I look at this issue I get more and more confused.

BR, José Pereira

xicop_86＠hotmail.com

1:59 a.m.

Also something I should have realized sooner that might be relevant, I have 2 Providers behind a Load Balancer for fault tolerance. Both these Providers have the same data, but I assume some "invisible" data might be different has these servers are not mirrored.

I have notice sometimes the issue correlates with the connection falling to the 2nd server, but when I tried to replicate the issue by forcing the 1st server to fail, I was not successful to replicate the issue.

BR, José Pereira

Ondřej Kuzník

3 Sep 3 Sep

1:49 a.m.

On Mon, Sep 02, 2024 at 08:59:23AM -0000, xicop_86@hotmail.com wrote:

...

Also something I should have realized sooner that might be relevant, I have 2 Providers behind a Load Balancer for fault tolerance. Both these Providers have the same data, but I assume some "invisible" data might be different has these servers are not mirrored.

I have notice sometimes the issue correlates with the connection falling to the 2nd server, but when I tried to replicate the issue by forcing the 1st server to fail, I was not successful to replicate the issue.

In that case, you probably want to gather sync level data from the providers and consumer and check if there's anything that correlates. Also for efficiency sake, maybe set up a sessionlog rather than going through a full present phase every time.

Regards,

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

xicop_86＠hotmail.com

2:04 a.m.

I already have sessionlog on the provider "olcSpSessionlog: 100" although I question if it is enough, as the Provider does have lots of data. As for the Sync level data, are you referring to sync logs? I shared some info on that before, but nothing stood out to me.

BR, José Pereira

Ondřej Kuzník

6:59 a.m.

On Tue, Sep 03, 2024 at 09:04:57AM -0000, xicop_86@hotmail.com wrote:

...

I already have sessionlog on the provider "olcSpSessionlog: 100" although I question if it is enough, as the Provider does have lots of data. As for the Sync level data, are you referring to sync logs? I shared some info on that before, but nothing stood out to me.

Ah, I see you did and now it clicked that you're running 2 providers but both seem to have a serverID of 0. This is guaranteed to end up in desyncs of various kinds. I suggest you assign a unique (non-zero) serverID to each provider.

Regards,

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

xicop_86＠hotmail.com

7:48 a.m.

Ondřej Kuzník wrote:

...

On Tue, Sep 03, 2024 at 09:04:57AM -0000, xicop_86@hotmail.com wrote:

...
I already have sessionlog on the provider "olcSpSessionlog: 100" although I question if it is enough, as the Provider does have lots of data. As for the Sync level data, are you referring to sync logs? I shared some info on that before, but nothing stood out to me. Ah, I see you did and now it clicked that you're running 2 providers but

both seem to have a serverID of 0. This is guaranteed to end up in desyncs of various kinds. I suggest you assign a unique (non-zero) serverID to each provider. Regards,

That actually makes a lot of sense! Would I then only have to set different ServerID on the providers? No change needed on the Consumers?

BR, José Pereira

Ondřej Kuzník

9:20 a.m.

On Tue, Sep 03, 2024 at 02:48:15PM -0000, xicop_86@hotmail.com wrote:

...

Ondřej Kuzník wrote:

...
Ah, I see you did and now it clicked that you're running 2 providers but both seem to have a serverID of 0. This is guaranteed to end up in desyncs of various kinds. I suggest you assign a unique (non-zero) serverID to each provider.

That actually makes a lot of sense! Would I then only have to set different ServerID on the providers? No change needed on the Consumers?

Correct, consumers do not need this setting.

-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP

xicop_86＠hotmail.com

11:53 p.m.

Ondřej Kuzník wrote:

...

On Tue, Sep 03, 2024 at 02:48:15PM -0000, xicop_86@hotmail.com wrote:

...
Ondřej Kuzník wrote: Ah, I see you did and now it clicked that you're running 2 providers but both seem to have a serverID of 0. This is guaranteed to end up in desyncs of various kinds. I suggest you assign a unique (non-zero) serverID to each provider. That actually makes a lot of sense! Would I then only have to set different ServerID on the providers? No change needed on the Consumers? Correct, consumers do not need this setting.

Thanks a lot Ondřej I will apply this change! Just for my clarification then, on the Consumer we have a setting rid=000. That is completely unrelated to the ServerID on the Provider? Just concerned if the Load balancer points to the second server and the ServerID is different on the second server, it will break the sync.

BR, José Pereira

Quanah Gibson-Mount

4 Sep 4 Sep

7:55 a.m.

--On Wednesday, September 4, 2024 7:53 AM +0000 xicop_86@hotmail.com wrote:

...

Ondřej Kuzník wrote:

...
On Tue, Sep 03, 2024 at 02:48:15PM -0000, xicop_86@hotmail.com wrote:

...
Ondřej Kuzník wrote: Ah, I see you did and now it clicked that you're running 2 providers but both seem to have a serverID of 0. This is guaranteed to end up in desyncs of various kinds. I suggest you assign a unique (non-zero) serverID to each provider. That actually makes a lot of sense! Would I then only have to set different ServerID on the providers? No change needed on the Consumers? Correct, consumers do not need this setting.

Thanks a lot Ondřej I will apply this change! Just for my clarification then, on the Consumer we have a setting rid=000. That is completely unrelated to the ServerID on the Provider? Just concerned if the Load balancer points to the second server and the ServerID is different on the second server, it will break the sync.

ServerID must be unique per provider when you have more than one provider. Load balancing is not a factor as it simply tells all sync clients (including the other provider) who originated the change.

--Quanah

300

Age (days ago)

348

Last active (days ago)

openldap-technical@openldap.org

18 comments

4 participants

tags (0)

participants (4)

jose Pereira
Ondřej Kuzník
Quanah Gibson-Mount
xicop_86＠hotmail.com