Hi Ondřej,

Thanks for getting back. I do have the logs from a previous replication stall. I'll capture the logs again next time it happens. I checked the logs. I don't see any abandoned connections. 

aaa-prod-aws-12:1636
# requesting: contextCSN
contextCSN: 20250102015911.702871Z#000000#000#000000

All the relevant logs and info:

dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor
structuralObjectClass: olmSyncReplInstance
creatorsName:
modifiersName:
createTimestamp: 20241209130653Z
modifyTimestamp: 20241209130653Z
olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/
olmSRConnection: IP=172.20.86.12:49880
olmSRSyncPhase: Persist
olmSRNextConnect: 00000101000000Z
olmSRLastConnect: 20241229203510Z
olmSRLastContact: 20250102015934Z
olmSRLastCookieRcvd: rid=152,csn=20250102015911.702871Z#000000#000#000000
olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000
entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor
subschemaSubentry: cn=Subschema
hasSubordinates: FALSE

Consumer:
netstat -an | grep 49880
tcp        0      0 172.20.86.12:49880      172.17.21.52:636        ESTABLISHED

Master:
netstat -an | grep 172.20.86.12
tcp        0      0 172.17.21.52:636        172.20.86.12:49880      ESTABLISHED

Master logs:
Jan  1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000
Jan  1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000

Nothing about rid=152 is logged after the above.

Consumer logs:
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000


Nothing about replication is logged after the above.

From the last coredump:

Thread 1 (Thread 0x7f85243fa640 (LWP 192314)):
#0  connection_abandon (c=0x7f9eb4ad0078) at connection.c:714
#1  0x00000000004460d5 in connection_closing (c=0x7f9eb4ad0078, why=0x5db380 <conn_lost_str> "connection lost") at connection.c:785
#2  0x0000000000447d18 in connection_read (s=31, cri=0x7f85243f99a0) at connection.c:1453
#3  0x000000000044741b in connection_read_thread (ctx=0x7f85243f99f0, argv=0x1f) at connection.c:1260
#4  0x00007f9ecd406bed in ldap_int_thread_pool_wrapper (xpool=0xac8080) at tpool.c:1059
#5  0x00007f9ecca89c02 in start_thread () from /lib64/libc.so.6
#6  0x00007f9eccb0ec40 in clone3 () from /lib64/libc.so.6
No core file now.

Thanks,
Suresh

On Tue, Mar 4, 2025 at 6:12 AM Ondřej Kuzník <ondra@mistotebe.net> wrote:
On Mon, Jan 13, 2025 at 10:42:58AM -0500, Suresh Veliveli wrote:
> Hi Ondřej,
>
> Attached is the file from the last crash for "thread apply all bt full". I
> built it from the src (openldap.org). The installation is prefixed to
> /var/services/openldap directory. I do have "stats sync" log level enabled.
> Our logs are huge, I could get the necessary info if you can tell what I
> need to look for.

Hi Suresh,
as I mentioned, you want to see what the provider was doing with the
session and the decisions it took along the way. To see that, you want
to find where the session starts (where you find this "cookie=rid=..."
message) and *then* use the "conn=xxx op=yyy" you find in this message
to isolate the messages that correlate with it. That's the first thing
you'll need to track down what eventually happened to the session.

If it's related to the crash in any way, it might also show us if
something went wrong if we're lucky.

Also just out of interest, are there any Abandon/Cancel requests in the
logs?

Thanks,

--
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP


--
Suresh Veliveli
Sr. UNIX Systems Engineer
Georgetown University
University Information Services | Security Infrastructure and Policy-Identity and Collaboration
202-262-6676 (cell) | 202-687-3108 (work)