Hi!
When upgrading from OpenLDAP 2.4 (SLES12) to OpenLDAP 2.5 (SLES15) I gave delta-syncrepl a try. That was a hard way in several aspects. Meanwhile I think I understand most of the details (docs could be much better IMHO).
Where delta-syncrepl has big problems is when sync has been set up, but one database is reloaded and some UUIDs are newly created for entries that exist on the other server(s). Somehow slapd detects that problem and claims that a “content sync” is required, but after some time it seems to start a refresh anyway. When I did the content load on the other server, slapd quit with a core dump. Unfortunately I had quite a lot of core dumps during my testing. So I feel delta-syncrepl is not as solid as it should be (in the version provided with SLES15 SP6).
Some of the logs: May 27 13:37:05 v06 slapd[27194]: do_syncrep1: rid=105 starting refresh (sending cookie=rid=105,sid=006,csn=20130708113956.028116Z#000000#000#000000;20240708090039.074592Z#000000#001#0> May 27 13:37:05 v06 slapd[27194]: do_syncrep2: rid=105 LDAP_RES_SEARCH_RESULT May 27 13:37:05 v06 slapd[27194]: do_syncrep2: rid=105 delta-sync lost sync, switching to REFRESH May 27 13:37:05 v06 slapd[27194]: do_syncrep2: rid=105 (4096) Content Sync Refresh Required May 27 13:41:36 v06 slapd[27194]: do_syncrep1: rid=105 starting refresh (sending cookie=rid=105,sid=006,csn=20130708113956.028116Z#000000#000#000000;20240708090039.074592Z#000000#001#0> May 27 13:41:36 v06 slapd[27194]: do_syncrep2: rid=105 LDAP_RES_INTERMEDIATE - SYNC_ID_SET May 27 13:41:36 v06 slapd[27194]: syncrepl_message_to_entry: rid=105 DN: cn=config, UUID: 85353d82-cf3a-103f-9d14-755ae58f92b6 May 27 13:41:36 v06 slapd[27194]: syncrepl_entry: rid=105 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) csn=(none) tid 0x7f6b1b7fe6c0 … May 27 13:43:33 v06 slapd[27194]: do_syncrepl: rid=105 rc -1 retrying (4 retries left) May 27 13:43:34 v06 slapd[27194]: conn=1008 op=2 syncprov_op_search: got a persistent search with a cookie=rid=106,sid=005,csn=20130708113956.028116Z#000000#000#000000;20240708090039.0> May 27 13:43:34 v06 slapd[27194]: conn=1008 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1008 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1008 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1008 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_op_search: got a persistent search with a cookie=rid=116,sid=005,csn=20130719093756.074776Z#000000#000#000000;20250217105250.3> May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_op_search: no change, skipping log replay May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_op_search: nothing changed, finishing up initial search early May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1009 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1011 op=2 syncprov_op_search: got a persistent search with a cookie=rid=106,sid=005,csn=20130708113956.028116Z#000000#000#000000;20240708090039.0> May 27 13:43:34 v06 slapd[27194]: conn=1011 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1011 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1011 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1011 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_op_search: got a persistent search with a cookie=rid=116,sid=005,csn=20130719093756.074776Z#000000#000#000000;20250217105250.3> May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_op_search: no change, skipping log replay May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_op_search: nothing changed, finishing up initial search early May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1010 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_op_search: got a persistent search with a cookie=rid=116,sid=005,csn=20130719093756.074776Z#000000#000#000000;20250217105250.3> May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_op_search: no change, skipping log replay May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_op_search: nothing changed, finishing up initial search early May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1012 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1013 op=2 syncprov_op_search: got a persistent search with a cookie=rid=106,sid=005,csn=20130708113956.028116Z#000000#000#000000;20240708090039.0> May 27 13:43:34 v06 slapd[27194]: conn=1013 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1013 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 slapd[27194]: conn=1013 op=2 syncprov_sendinfo: refreshDelete cookie= May 27 13:43:34 v06 slapd[27194]: conn=1013 op=2 syncprov_search_response: detaching op May 27 13:43:34 v06 slapd[27194]: conn=1014 op=2 syncprov_op_search: got a persistent search with a cookie=rid=106,sid=005,csn=20130708113956.028116Z#000000#000#000000;20240708090039.0> May 27 13:43:34 v06 slapd[27194]: conn=1014 op=2 syncprov_findbase: searching May 27 13:43:34 v06 slapd[27194]: conn=1014 op=2 syncprov_op_search: registered persistent search May 27 13:43:34 v06 systemd[1]: Started Process Core Dump (PID 27241/UID 0). May 27 13:43:35 v06 systemd-coredump[27242]: [🡕] Process 27194 (slapd) of user 76 dumped core.
Stack trace of thread 27199: #0 0x00007f6b34ca8dfc __pthread_kill_implementation (libc.so.6 + 0xa8dfc) #1 0x00007f6b34c57842 raise (libc.so.6 + 0x57842) #2 0x00007f6b34c3f5cf abort (libc.so.6 + 0x3f5cf) #3 0x00007f6b34c3f4e7 __assert_fail_base.cold (libc.so.6 + 0x3f4e7) #4 0x00007f6b34c4fb32 __assert_fail (libc.so.6 + 0x4fb32) #5 0x00007f6b34787258 n/a (syncprov.so + 0xc258) #6 0x000055765d7e04f3 overlay_op_walk (slapd + 0xb74f3) #7 0x000055765d7e06be n/a (slapd + 0xb76be) #8 0x000055765d76ee54 fe_op_search (slapd + 0x45e54) #9 0x000055765d76e726 do_search (slapd + 0x45726) #10 0x000055765d76c18f n/a (slapd + 0x4318f) #11 0x000055765d76c98d n/a (slapd + 0x4398d) #12 0x00007f6b34ff7da0 n/a (libldap-2.5.releng.so.0 + 0x48da0) #13 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #14 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27194: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca8b83 __pthread_clockjoin_ex (libc.so.6 + 0xa8b83) #2 0x000055765d7691ca slapd_daemon (slapd + 0x401ca) #3 0x000055765d74c88e main (slapd + 0x2388e) #4 0x00007f6b34c40e6c __libc_start_call_main (libc.so.6 + 0x40e6c) #5 0x00007f6b34c40f35 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x40f35) #6 0x000055765d74c90a _start (slapd + 0x2390a)
Stack trace of thread 27204: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27198: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27197: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27202: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27200: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27203: #0 0x00007f6b34ca3c1e __futex_abstimed_wait_common (libc.so.6 + 0xa3c1e) #1 0x00007f6b34ca6548 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0xa6548) #2 0x00007f6b34ff7e40 n/a (libldap-2.5.releng.so.0 + 0x48e40) #3 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #4 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
Stack trace of thread 27196: #0 0x00007f6b34d2e796 epoll_wait (libc.so.6 + 0x12e796) #1 0x000055765d76607b n/a (slapd + 0x3d07b) #2 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #3 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338) ELF object binary architecture: AMD x86-64 May 27 13:43:35 v06 systemd[1]: systemd-coredump@5-27241-0.service: Deactivated successfully.
Kind regards, Ulrich Windl
On Tue, May 27, 2025 at 12:59:42PM +0000, Windl, Ulrich wrote:
Hi!
When upgrading from OpenLDAP 2.4 (SLES12) to OpenLDAP 2.5 (SLES15) I gave delta-syncrepl a try. That was a hard way in several aspects. Meanwhile I think I understand most of the details (docs could be much better IMHO).
Hi Ulrich, I am sorry that you have not had a great experience, please suggest which parts of documentation (manpages, admin guide) you feel should be adjusted.
Where delta-syncrepl has big problems is when sync has been set up, but one database is reloaded and some UUIDs are newly created for entries that exist on the other server(s). Somehow slapd detects that problem and claims that a “content sync” is required, but after some time it seems to start a refresh anyway.
It looks like your ACLs are not as documented or you have chosen to reload a database *not* from a slapcat preserving some information (entryCSNs, ...) and not preserving other (entryUUIDs)? The required ACLs have to give the replication identity *unrestricted* read access to both the replicated DB and its accesslog, anything else will lead to deltasync replication failing in various not always easy to spot ways.
Replication cannot figure this out for you because its own state is now inconsistent. Either start from scratch or use a slapcat+slapadd for the database. If you have actually done what I'm suggesting here, please describe how you got into this situation because that would be a bug.
When I did the content load on the other server, slapd quit with a core dump. Unfortunately I had quite a lot of core dumps during my testing. So I feel delta-syncrepl is not as solid as it should be (in the version provided with SLES15 SP6).
Yes, replication relies on keeping its own state that you interfere with at your own peril, potentially triggering temporary or even permanent desyncs. However if you encounter a crash, I will ask you again that you log a bug with steps to reproduce and/or a full backtrace with the necessary symbols available. And any logs you can provide, if you need to redact confidential information that is fine. We cannot fix bugs we are not aware of except by accident.
May 27 13:43:35 v06 systemd-coredump[27242]: [🡕] Process 27194 (slapd) of user 76 dumped core.
Stack trace of thread 27199: #0 0x00007f6b34ca8dfc __pthread_kill_implementation (libc.so.6 + 0xa8dfc) #1 0x00007f6b34c57842 raise (libc.so.6 + 0x57842) #2 0x00007f6b34c3f5cf abort (libc.so.6 + 0x3f5cf) #3 0x00007f6b34c3f4e7 __assert_fail_base.cold (libc.so.6 + 0x3f4e7) #4 0x00007f6b34c4fb32 __assert_fail (libc.so.6 + 0x4fb32) #5 0x00007f6b34787258 n/a (syncprov.so + 0xc258) #6 0x000055765d7e04f3 overlay_op_walk (slapd + 0xb74f3) #7 0x000055765d7e06be n/a (slapd + 0xb76be) #8 0x000055765d76ee54 fe_op_search (slapd + 0x45e54) #9 0x000055765d76e726 do_search (slapd + 0x45726) #10 0x000055765d76c18f n/a (slapd + 0x4318f) #11 0x000055765d76c98d n/a (slapd + 0x4398d) #12 0x00007f6b34ff7da0 n/a (libldap-2.5.releng.so.0 + 0x48da0) #13 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #14 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
This backtrace is not very useful, I suggest you not strip the binaries or make sure you have the relevant debuginfo packages in place and have systemd-coredump store the core file[0] so you can actually examine it after the fact with gdb.
[0]. https://systemd.io/COREDUMP/
Thanks,
See inline comments (as good as Outlook allows)
Kind regards, Ulrich Windl
-----Original Message----- From: Ondřej Kuzník ondra@mistotebe.net Sent: Wednesday, May 28, 2025 11:32 AM To: Windl, Ulrich u.windl@ukr.de Cc: openldap-technical@openldap.org Subject: [EXT] Re: delta-syncrepl experience with OpenLDAP 2.5 (from SLES15)
On Tue, May 27, 2025 at 12:59:42PM +0000, Windl, Ulrich wrote:
Hi!
When upgrading from OpenLDAP 2.4 (SLES12) to OpenLDAP 2.5 (SLES15) I gave delta-syncrepl a try. That was a hard way in several aspects. Meanwhile I think I understand most of the details (docs could be much better IMHO).
Hi Ulrich, I am sorry that you have not had a great experience, please suggest which parts of documentation (manpages, admin guide) you feel should be adjusted.
[Windl, Ulrich] In slapd-config(5): The olcSyncrepl description is quite long, and lists all the keywords. I think before there should be a description or a reference to a description explaining the replication variants (and details how theose work)
In the Admin Guide (18.2.1. Delta-syncrepl replication) disadvantages of LDAP Sync are explained and the advantages of delta-syncrepl, but the relationship between local database and the corresponding changelog is not explained well IMHO. For example "The replication consumer checks the changelog for the changes it needs": So the consumer queries the local changelog to see which changes to request from the provider (which, in turn, consults its local changelog to provide the changes)? If it works that way, can you use delta-syncrepl on-way without having a changelog on the consumer? I always felt it's important to keep the provider's database and changelog in sync for delta-syncrepl to work (well actually I thought the provider's changelog should sync from a refreshed database automatically), but it seems I have to delete the consumer's changelog database also after reloading the provider's content.
Maybe the admin guide should loose a few words on this...
Then "18.3.2.1. Delta-syncrepl Provider configuration
Setting up delta-syncrepl requires configuration changes on both the provider and replica servers:":
I think it would be better to summarize the changed needed before showing the example; otherwise the user has to guess from the example. Likewise for "18.3.2.2. Delta-syncrepl Consumer configuration". Also for "Note: An accesslog database is unique to a given provider.": It's not quite clear whether one changelog can be used for multiple databases on the provider (or whether each database needs a separate changelog). (same applies to the scope of RIDs: May different databases on a consumer have the same RIDs, or must they be different?)
Where delta-syncrepl has big problems is when sync has been set up, but one database is reloaded and some UUIDs are newly created for entries that exist on the other server(s). Somehow slapd detects that problem and claims that a “content sync” is required, but after some time it seems to start a refresh anyway.
It looks like your ACLs are not as documented or you have chosen to reload a database *not* from a slapcat preserving some information (entryCSNs, ...) and not preserving other (entryUUIDs)? The required
[Windl, Ulrich] Well I think the problem is related to schema updates that I implemented like this in LDIF:
dn: cn=schema,cn=config objectClass: olcSchemaConfig cn: schema structuralObjectClass: olcSchemaConfig entryUUID: db3f59a6-7c0e-1032-81c5-d54356bd918f creatorsName: cn=config createTimestamp: 20130708113956Z entryCSN: 20250313000000.000000Z#000000#005#000000 modifiersName: cn=config modifyTimestamp: 20250313000000Z
include: file:///etc/openldap/schema/core.ldif
include: file:///etc/openldap/schema/cosine.ldif
include: file:///etc/openldap/schema/inetorgperson.ldif
include: file:///etc/openldap/schema/rfc2307bis.ldif
include: file:///etc/openldap/schema/yast.ldif
include: file:///etc/openldap/schema/sudo.ldif
dn: olcDatabase={-1}frontend,cn=config ...
On import a new UUID and CSN will be created for each schema, but the consumer has some different UUID/CSN for each. I don't really know how to handle that maybe the schema LDIFs actually should contain a UUID and a CSN. I was thinking whether to generate those from either the files modification time, or from the content revision (even harder to get)
ACLs have to give the replication identity *unrestricted* read access to both the replicated DB and its accesslog, anything else will lead to deltasync replication failing in various not always easy to spot ways.
[Windl, Ulrich]
I think I did that. See above.
Replication cannot figure this out for you because its own state is now inconsistent. Either start from scratch or use a slapcat+slapadd for the database. If you have actually done what I'm suggesting here, please describe how you got into this situation because that would be a bug.
When I did the content load on the other server, slapd quit with a core dump. Unfortunately I had quite a lot of core dumps during my testing. So I feel delta-syncrepl is not as solid as it should be (in the version provided with SLES15 SP6).
Yes, replication relies on keeping its own state that you interfere with at your own peril, potentially triggering temporary or even permanent desyncs. However if you encounter a crash, I will ask you again that you log a bug with steps to reproduce and/or a full backtrace with the necessary symbols available. And any logs you can provide, if you need to redact confidential information that is fine. We cannot fix bugs we are not aware of except by accident.
[Windl, Ulrich]
I understand that, but unfortunately I'm not using one of yur official versions, so SUSE has to deal with their own patch sets, I'm afraid.
May 27 13:43:35 v06 systemd-coredump[27242]: [🡕] Process 27194 (slapd) of user 76 dumped core.
Stack trace of thread 27199: #0 0x00007f6b34ca8dfc __pthread_kill_implementation (libc.so.6 +
0xa8dfc)
#1 0x00007f6b34c57842 raise (libc.so.6 + 0x57842) #2 0x00007f6b34c3f5cf abort (libc.so.6 + 0x3f5cf) #3 0x00007f6b34c3f4e7 __assert_fail_base.cold (libc.so.6 + 0x3f4e7) #4 0x00007f6b34c4fb32 __assert_fail (libc.so.6 + 0x4fb32) #5 0x00007f6b34787258 n/a (syncprov.so + 0xc258) #6 0x000055765d7e04f3 overlay_op_walk (slapd + 0xb74f3) #7 0x000055765d7e06be n/a (slapd + 0xb76be) #8 0x000055765d76ee54 fe_op_search (slapd + 0x45e54) #9 0x000055765d76e726 do_search (slapd + 0x45726) #10 0x000055765d76c18f n/a (slapd + 0x4318f) #11 0x000055765d76c98d n/a (slapd + 0x4398d) #12 0x00007f6b34ff7da0 n/a (libldap-2.5.releng.so.0 + 0x48da0) #13 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #14 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
This backtrace is not very useful, I suggest you not strip the binaries or make sure you have the relevant debuginfo packages in place and have systemd-coredump store the core file[0] so you can actually examine it after the fact with gdb.
[Windl, Ulrich]
In SLES debug information is shipped I nseparate "debuginfo" packages, and I did not take the time to find out how to actually use them (well even some supporters at SUSE seem not to know; in my understanding they could use the core dump together with the binary and debuginfo to get more useful info from the dump, but that's a different topic)
[0]. https://systemd.io/COREDUMP/
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
Hi!
An update on my experience of delta-syncrepl with OpenLDAp 2.5 as provided by SLES15 SP6: In a configuration, I decided to test replication by stopping the server, clearing the changelog and the DIT, then starting it again. Unfortunately the providing server dore dumped then. After restarting, it continued. So I checked whether both servers are in sync, and the were not. So I cleared the changelog for the DIT on the crashed server (while topped, then started it again). Expecting sync to happen I saw nothing. Then I realized that the other server had core dumped. 8-( Somewhere in between these experiments an accesslog ran out of space, so I deleted that, too.
With the situation of two different servers after a core dump, I'm wondering what happens when an MDB changelog becomes full: Obviously new entries cannot be added (while the sync from remote virtually continues), but can an existing record be updated? Specifically I wonder about "updating mincsn": What happens to sync if the actual changelog entry could not be written, but mincsn is being updated? Will it cause inconsistency in the changelog?
Sorry, I still don't understand all the details of delta-syncrepl. As it is now, I don't feel it's ready for production, specifically if a (n outdated) consumer can cause the provider to dump core (as it seems).
I know that I should rather try version 2.6, but SLES support is only available for "their version".
Kind regards, Ulrich Windl
-----Original Message----- From: Windl, Ulrich u.windl@ukr.de Sent: Monday, June 2, 2025 12:27 PM To: Ondřej Kuzník ondra@mistotebe.net Cc: openldap-technical@openldap.org Subject: [EXT] RE: Re: delta-syncrepl experience with OpenLDAP 2.5 (from SLES15)
See inline comments (as good as Outlook allows)
Kind regards, Ulrich Windl
-----Original Message----- From: Ondřej Kuzník ondra@mistotebe.net Sent: Wednesday, May 28, 2025 11:32 AM To: Windl, Ulrich u.windl@ukr.de Cc: openldap-technical@openldap.org Subject: [EXT] Re: delta-syncrepl experience with OpenLDAP 2.5 (from SLES15)
On Tue, May 27, 2025 at 12:59:42PM +0000, Windl, Ulrich wrote:
Hi!
When upgrading from OpenLDAP 2.4 (SLES12) to OpenLDAP 2.5 (SLES15) I gave delta-syncrepl a try. That was a hard way in several aspects. Meanwhile I think I understand most of the details (docs could be much better IMHO).
Hi Ulrich, I am sorry that you have not had a great experience, please suggest which parts of documentation (manpages, admin guide) you feel should be adjusted.
[Windl, Ulrich] In slapd-config(5): The olcSyncrepl description is quite long, and lists all the keywords. I think before there should be a description or a reference to a description explaining the replication variants (and details how theose work)
In the Admin Guide (18.2.1. Delta-syncrepl replication) disadvantages of LDAP Sync are explained and the advantages of delta-syncrepl, but the relationship between local database and the corresponding changelog is not explained well IMHO. For example "The replication consumer checks the changelog for the changes it needs": So the consumer queries the local changelog to see which changes to request from the provider (which, in turn, consults its local changelog to provide the changes)? If it works that way, can you use delta-syncrepl on-way without having a changelog on the consumer? I always felt it's important to keep the provider's database and changelog in sync for delta-syncrepl to work (well actually I thought the provider's changelog should sync from a refreshed database automatically), but it seems I have to delete the consumer's changelog database also after reloading the provider's content.
Maybe the admin guide should loose a few words on this...
Then "18.3.2.1. Delta-syncrepl Provider configuration
Setting up delta-syncrepl requires configuration changes on both the provider and replica servers:":
I think it would be better to summarize the changed needed before showing the example; otherwise the user has to guess from the example. Likewise for "18.3.2.2. Delta-syncrepl Consumer configuration". Also for "Note: An accesslog database is unique to a given provider.": It's not quite clear whether one changelog can be used for multiple databases on the provider (or whether each database needs a separate changelog). (same applies to the scope of RIDs: May different databases on a consumer have the same RIDs, or must they be different?)
Where delta-syncrepl has big problems is when sync has been set up, but one database is reloaded and some UUIDs are newly created for entries that exist on the other server(s). Somehow slapd detects that problem and claims that a “content sync” is required, but after some time it seems to start a refresh anyway.
It looks like your ACLs are not as documented or you have chosen to reload a database *not* from a slapcat preserving some information (entryCSNs, ...) and not preserving other (entryUUIDs)? The required
[Windl, Ulrich] Well I think the problem is related to schema updates that I implemented like this in LDIF:
dn: cn=schema,cn=config objectClass: olcSchemaConfig cn: schema structuralObjectClass: olcSchemaConfig entryUUID: db3f59a6-7c0e-1032-81c5-d54356bd918f creatorsName: cn=config createTimestamp: 20130708113956Z entryCSN: 20250313000000.000000Z#000000#005#000000 modifiersName: cn=config modifyTimestamp: 20250313000000Z
include: file:///etc/openldap/schema/core.ldif
include: file:///etc/openldap/schema/cosine.ldif
include: file:///etc/openldap/schema/inetorgperson.ldif
include: file:///etc/openldap/schema/rfc2307bis.ldif
include: file:///etc/openldap/schema/yast.ldif
include: file:///etc/openldap/schema/sudo.ldif
dn: olcDatabase={-1}frontend,cn=config ...
On import a new UUID and CSN will be created for each schema, but the consumer has some different UUID/CSN for each. I don't really know how to handle that maybe the schema LDIFs actually should contain a UUID and a CSN. I was thinking whether to generate those from either the files modification time, or from the content revision (even harder to get)
ACLs have to give the replication identity *unrestricted* read access to both the replicated DB and its accesslog, anything else will lead to deltasync replication failing in various not always easy to spot ways.
[Windl, Ulrich]
I think I did that. See above.
Replication cannot figure this out for you because its own state is now inconsistent. Either start from scratch or use a slapcat+slapadd for the database. If you have actually done what I'm suggesting here, please describe how you got into this situation because that would be a bug.
When I did the content load on the other server, slapd quit with a core dump. Unfortunately I had quite a lot of core dumps during my testing. So I feel delta-syncrepl is not as solid as it should be (in the version provided with SLES15 SP6).
Yes, replication relies on keeping its own state that you interfere with at your own peril, potentially triggering temporary or even permanent desyncs. However if you encounter a crash, I will ask you again that you log a bug with steps to reproduce and/or a full backtrace with the necessary symbols available. And any logs you can provide, if you need to redact confidential information that is fine. We cannot fix bugs we are not aware of except by accident.
[Windl, Ulrich]
I understand that, but unfortunately I'm not using one of yur official versions, so SUSE has to deal with their own patch sets, I'm afraid.
May 27 13:43:35 v06 systemd-coredump[27242]: [🡕] Process 27194 (slapd) of user 76 dumped core.
Stack trace of thread 27199: #0 0x00007f6b34ca8dfc __pthread_kill_implementation (libc.so.6 +
0xa8dfc)
#1 0x00007f6b34c57842 raise (libc.so.6 + 0x57842) #2 0x00007f6b34c3f5cf abort (libc.so.6 + 0x3f5cf) #3 0x00007f6b34c3f4e7 __assert_fail_base.cold (libc.so.6 + 0x3f4e7) #4 0x00007f6b34c4fb32 __assert_fail (libc.so.6 + 0x4fb32) #5 0x00007f6b34787258 n/a (syncprov.so + 0xc258) #6 0x000055765d7e04f3 overlay_op_walk (slapd + 0xb74f3) #7 0x000055765d7e06be n/a (slapd + 0xb76be) #8 0x000055765d76ee54 fe_op_search (slapd + 0x45e54) #9 0x000055765d76e726 do_search (slapd + 0x45726) #10 0x000055765d76c18f n/a (slapd + 0x4318f) #11 0x000055765d76c98d n/a (slapd + 0x4398d) #12 0x00007f6b34ff7da0 n/a (libldap-2.5.releng.so.0 + 0x48da0) #13 0x00007f6b34ca6f6c start_thread (libc.so.6 + 0xa6f6c) #14 0x00007f6b34d2e338 __clone3 (libc.so.6 + 0x12e338)
This backtrace is not very useful, I suggest you not strip the binaries or make sure you have the relevant debuginfo packages in place and have systemd-coredump store the core file[0] so you can actually examine it after the fact with gdb.
[Windl, Ulrich]
In SLES debug information is shipped I nseparate "debuginfo" packages, and I did not take the time to find out how to actually use them (well even some supporters at SUSE seem not to know; in my understanding they could use the core dump together with the binary and debuginfo to get more useful info from the dump, but that's a different topic)
[0]. https://systemd.io/COREDUMP/
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
openldap-technical@openldap.org