This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
Any thoughts?
Thanks, suresh
On Tue, Dec 24, 2024 at 10:41 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Regards, Suresh
On Sat, Dec 28, 2024 at 10:05 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts?
Thanks, suresh
On Tue, Dec 24, 2024 at 10:41 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
On Sun, Dec 29, 2024 at 01:28:42PM -0500, Suresh Veliveli wrote:
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Hi Suresh, anything in the provider logs around that time? All consumers messages will be tagged with a specific "conn=xxx op=yyy" which you can discover e.g. by looking for the cookie it sends at the beginning of the session.
Couple of other questions: - is the TCP connection alive as far as the OS is concerned (I see in the thread you've confirmed TCP keepalive is enabled, correct?) - could you post the cn=monitor info for the consumer? The objectclass to look for is olmSyncReplInstance
Thanks,
Another instance: Yes, TCP keepalive is enabled.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
*Consumer logs:*
Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: do_syncrep2: rid=147 cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250101065905.147164Z#000000#000#000000 tid 0x7f8e4a5fd640 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_queue_csn: queueing 0x7f8e40152360 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_graduate_commit_csn: removing 0x7f8e40152360 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_queue_csn: queueing 0x7f8e4036c6c0 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_graduate_commit_csn: removing 0x7f8e4036c6c0 20250101065905.147164Z#000000#000#000000
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000
(Nothing after the above for rid=147)
*After restarting Consumer:*
Jan 2 09:29:54 aaa-prod-aws-7 slapd[2750000]: slap_get_csn: conn=-1 op=0 generated new csn=20250102142954.929120Z#000000#000#000000 manage= 0 Jan 2 09:29:55 aaa-prod-aws-7 slapd[2750000]: do_syncrep1: rid=147 starting refresh (sending cookie=rid=147,csn=20250101065905.147164Z#0000 00#000#000000) Jan 2 09:29:59 aaa-prod-aws-7 slapd[2750000]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) csn=(none) tid 0x7f3288dfd640 Jan 2 09:29:59 aaa-prod-aws-7 slapd[2750000]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) csn=(none) tid 0x7f3288dfd640
Thanks, Suresh
On Thu, Jan 2, 2025 at 6:34 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Sun, Dec 29, 2024 at 01:28:42PM -0500, Suresh Veliveli wrote:
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Hi Suresh, anything in the provider logs around that time? All consumers messages will be tagged with a specific "conn=xxx op=yyy" which you can discover e.g. by looking for the cookie it sends at the beginning of the session.
Couple of other questions:
- is the TCP connection alive as far as the OS is concerned (I see in the thread you've confirmed TCP keepalive is enabled, correct?)
- could you post the cn=monitor info for the consumer? The objectclass to look for is olmSyncReplInstance
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn=*20250102015911.702871Z#000000#000#000000* olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
*Consumer logs:* Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Nothing about replication is logged after the above.
Thanks, Suresh
On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
The OS is Rock9, AWS EC2 instance.
On Thu, Jan 2, 2025 at 10:32 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn= *20250102015911.702871Z#000000#000#000000* olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
*Consumer logs:* Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Nothing about replication is logged after the above.
Thanks, Suresh
On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
--On Friday, January 3, 2025 7:52 AM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
The OS is Rock9, AWS EC2 instance.
Question - Was this working on an earlier OpenLDAP version, and just started having issues with 2.6.9? I was re-reading your initial post and that wasn't clear.
Thanks!
--Quanah
We are still experiencing replication stalls. What additional information can I provide to identify the underlying issue?
Thanks, Suresh
On Fri, Jan 3, 2025 at 7:52 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The OS is Rock9, AWS EC2 instance.
On Thu, Jan 2, 2025 at 10:32 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn= *20250102015911.702871Z#000000#000#000000* olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
*Consumer logs:* Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Nothing about replication is logged after the above.
Thanks, Suresh
On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org http://openldap.org/). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník <ondra@mistotebe.net mailto:ondra@mistotebe.net> wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com http://www.symas.com/ Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM <ogg@sr375.com mailto:ogg@sr375.com> wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org http://openldap.org/). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník <ondra@mistotebe.net mailto:ondra@mistotebe.net> wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com http://www.symas.com/ Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
I have not experienced the master crashing but I have also experienced stalled replication regularly over the last 2 years. We’ve been on 2.46 and have keep alive enabled. We were hoping an upgrade would resolve the problem but it seems to exist on latest versions as well. I am also on aws ec2 instances.
On Thu, Jan 16, 2025 at 8:02 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
--On Friday, January 17, 2025 10:51 AM -0700 Thomas Pritchard pritchardtw@gmail.com wrote:
I have not experienced the master crashing but I have also experienced stalled replication regularly over the last 2 years. We've been on 2.46 and have keep alive enabled. We were hoping an upgrade would resolve the problem but it seems to exist on latest versions as well. I am also on aws ec2 instances.
I've been running OpenLDAP in AWS on EC2 instances for the last 3 years at two different companies, and have never encountered this behavior.
--Quanah
My sysadmin says nothing is amiss with the interface.
Regards, Suresh
On Thu, Jan 16, 2025 at 10:02 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
On Mon, Jan 13, 2025 at 10:42:58AM -0500, Suresh Veliveli wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Hi Suresh, as I mentioned, you want to see what the provider was doing with the session and the decisions it took along the way. To see that, you want to find where the session starts (where you find this "cookie=rid=..." message) and *then* use the "conn=xxx op=yyy" you find in this message to isolate the messages that correlate with it. That's the first thing you'll need to track down what eventually happened to the session.
If it's related to the crash in any way, it might also show us if something went wrong if we're lucky.
Also just out of interest, are there any Abandon/Cancel requests in the logs?
Thanks,
Hi Ondřej,
Thanks for getting back. I do have the logs from a previous replication stall. I'll capture the logs again next time it happens. I checked the logs. I don't see any abandoned connections.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: 20250102015911.702871Z#000000#000#000000
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn=20250102015911.702871Z#000000#000#000000 olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 *Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=**20250102015911.702871Z#000000#* *000#000000*
*Nothing about rid=152 is logged after the above.*
*Consumer logs:Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000*
*Nothing about replication is logged after the above.*
From the last coredump:
Thread 1 (Thread 0x7f85243fa640 (LWP 192314)): #0 connection_abandon (c=0x7f9eb4ad0078) at connection.c:714 #1 0x00000000004460d5 in connection_closing (c=0x7f9eb4ad0078, why=0x5db380 <conn_lost_str> "connection lost") at connection.c:785 #2 0x0000000000447d18 in connection_read (s=31, cri=0x7f85243f99a0) at connection.c:1453 #3 0x000000000044741b in connection_read_thread (ctx=0x7f85243f99f0, argv=0x1f) at connection.c:1260 #4 0x00007f9ecd406bed in ldap_int_thread_pool_wrapper (xpool=0xac8080) at tpool.c:1059 #5 0x00007f9ecca89c02 in start_thread () from /lib64/libc.so.6 #6 0x00007f9eccb0ec40 in clone3 () from /lib64/libc.so.6 No core file now.
Thanks,
Suresh
On Tue, Mar 4, 2025 at 6:12 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Mon, Jan 13, 2025 at 10:42:58AM -0500, Suresh Veliveli wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full".
I
built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level
enabled.
Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Hi Suresh, as I mentioned, you want to see what the provider was doing with the session and the decisions it took along the way. To see that, you want to find where the session starts (where you find this "cookie=rid=..." message) and *then* use the "conn=xxx op=yyy" you find in this message to isolate the messages that correlate with it. That's the first thing you'll need to track down what eventually happened to the session.
If it's related to the crash in any way, it might also show us if something went wrong if we're lucky.
Also just out of interest, are there any Abandon/Cancel requests in the logs?
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Tue, Mar 04, 2025 at 08:48:46AM -0500, Suresh Veliveli wrote:
Hi Ondřej,
Thanks for getting back. I do have the logs from a previous replication stall. I'll capture the logs again next time it happens. I checked the logs. I don't see any abandoned connections.
[...]
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 *Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=**20250102015911.702871Z#000000#* *000#000000*
*Nothing about rid=152 is logged after the above.*
Hi Suresh, I am not asking you to look for rid=152, I am asking to use that message to find the correct connid=x op=y pair (e.g. conn=1035 op=1 here) that you *then* can use to find all messages related to that session which is what we are *actually* interested in.
None of the other information is likely to be useful without you doing this and we won't make any progress until that's been gathered...
Thanks,
Ok. Here it is from Feb 27th. Logs from the master.
*Feb 27 11:59:53 *aaa-prod-master-1 slapd[155194]: *conn=3685651 *op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 *Feb 27 23:06:05* aaa-prod-master-1 slapd[693294]: *conn=8478 *op=1 syncprov_op_search: got a persistent search with a cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1 syncprov_search_response: cookie=rid=129,csn=20250228040540.047268Z#000000#000#000000
Nothing gets logged from Feb 27 11:59 to Feb 27 23:06 (this is when the replica is restarted)
*Logs for conn=3685651:*
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165924.610754Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jc3689,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165924.624981Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jc3689,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendinfo: sending a new cookie=rid=132,csn=20250227165924.651294Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendinfo: sending a new cookie=rid=132,csn=20250227165924.664995Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165924.685466Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=dt784,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165924.697980Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=dt784,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165944.116976Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=aa3184,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.741563Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mr2086,ou=people,dc=georgetown,dc=edu Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mr2086,ou=people,dc=georgetown,dc=edu
The above is the last entry in logs for conn=3685651.
On Tue, Mar 4, 2025 at 9:44 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Tue, Mar 04, 2025 at 08:48:46AM -0500, Suresh Veliveli wrote:
Hi Ondřej,
Thanks for getting back. I do have the logs from a previous replication stall. I'll capture the logs again next time it happens. I checked the logs. I don't see any abandoned connections.
[...]
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 *Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=**20250102015911.702871Z#000000#* *000#000000*
*Nothing about rid=152 is logged after the above.*
Hi Suresh, I am not asking you to look for rid=152, I am asking to use that message to find the correct connid=x op=y pair (e.g. conn=1035 op=1 here) that you *then* can use to find all messages related to that session which is what we are *actually* interested in.
None of the other information is likely to be useful without you doing this and we won't make any progress until that's been gathered...
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Tue, Mar 04, 2025 at 10:23:44AM -0500, Suresh Veliveli wrote:
Ok. Here it is from Feb 27th. Logs from the master.
*Feb 27 11:59:53 *aaa-prod-master-1 slapd[155194]: *conn=3685651 *op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 *Feb 27 23:06:05* aaa-prod-master-1 slapd[693294]: *conn=8478 *op=1 syncprov_op_search: got a persistent search with a cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1 syncprov_search_response: cookie=rid=129,csn=20250228040540.047268Z#000000#000#000000
Nothing gets logged from Feb 27 11:59 to Feb 27 23:06 (this is when the replica is restarted)
Hi Suresh, thanks, even though it's quite strange because that's normal traffic and then abruptly nothing. Does anything else ("syncprov" related) get logged after that message at all?
Thanks,
*Not for rid=129.* Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.741563Z#000000#000#000000 *Feb 27 11:59:53* aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000
*Feb 27 23:07:07* aaa-prod-master-1 slapd[693294]: conn=8478 op=1 syncprov_sendresp: cookie=rid=129,csn=20250228040634.365834Z#000000#000#000000
*For other replicas:* Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1 syncprov_sendresp: cookie=rid=245,csn=20250227165924.192557Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mbw86,ou=people,dc=georget own,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1 syncprov_sendresp: cookie=rid=143,csn=20250227165918.446243Z#000000#000#000 000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=res128,ou=people,dc=geo rgetown,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1 syncprov_sendresp: cookie=rid=247,csn=20250227165921.534510Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jh2526,ou=people,dc=george town,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1 syncprov_sendresp: cookie=rid=644,csn=20250227165917.751851Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mpssim,ou=people,dc=george town,dc=edu
Thanks, Suresh
On Tue, Mar 4, 2025 at 10:57 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Tue, Mar 04, 2025 at 10:23:44AM -0500, Suresh Veliveli wrote:
Ok. Here it is from Feb 27th. Logs from the master.
*Feb 27 11:59:53 *aaa-prod-master-1 slapd[155194]: *conn=3685651 *op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 *Feb 27 23:06:05* aaa-prod-master-1 slapd[693294]: *conn=8478 *op=1 syncprov_op_search: got a persistent search with a cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000 Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1 syncprov_search_response: cookie=rid=129,csn=20250228040540.047268Z#000000#000#000000
Nothing gets logged from Feb 27 11:59 to Feb 27 23:06 (this is when the replica is restarted)
Hi Suresh, thanks, even though it's quite strange because that's normal traffic and then abruptly nothing. Does anything else ("syncprov" related) get logged after that message at all?
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Tue, Mar 04, 2025 at 11:27:21AM -0500, Suresh Veliveli wrote:
Not for rid=129.
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.741563Z#000000#000#000000 Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1 syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000
Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1 syncprov_sendresp: cookie=rid=129,csn=20250228040634.365834Z#000000#000#000000
For other replicas: Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1 syncprov_sendresp: cookie=rid=245,csn=20250227165924.192557Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mbw86,ou=people,dc=georgetown,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1 syncprov_sendresp: cookie=rid=143,csn=20250227165918.446243Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=res128,ou=people,dc=georgetown,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1 syncprov_sendresp: cookie=rid=247,csn=20250227165921.534510Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jh2526,ou=people,dc=georgetown,dc=edu Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1 syncprov_sendresp: cookie=rid=644,csn=20250227165917.751851Z#000000#000#000000 Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1 syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mpssim,ou=people,dc=georgetown,dc=edu
I can't see any errors or anything interesting in here, but is there anything newer than 20250227165948.748550Z#000000#000#000000 that's sent to the other replicas before you restarted the provider - is rid=129 the only one that got stuck? Can't find the relevant bits in the configuration you've posted so far, are you sure the replication identity has their limits (time and size limits) set to unlimited?
Thanks,
Yes. New changes have been sent to other replicas. Only rid=129 is stuck until restart; after that, it catches up. Limits are set to unlimited for the replication dn.
limits dn.exact="uid=syncrepladmin,ou=RepAdmin,dc=georgetown,dc=edu" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited
Thanks, Suresh
On Wed, Mar 5, 2025 at 5:41 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Tue, Mar 04, 2025 at 11:27:21AM -0500, Suresh Veliveli wrote:
Not for rid=129.
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1
syncprov_sendresp: cookie=rid=129,csn=20250227165948.741563Z#000000#000#000000
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1
syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000
Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1
syncprov_sendresp: cookie=rid=129,csn=20250228040634.365834Z#000000#000#000000
For other replicas: Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1
syncprov_sendresp: cookie=rid=245,csn=20250227165924.192557Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mbw86,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1
syncprov_sendresp: cookie=rid=143,csn=20250227165918.446243Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=res128,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1
syncprov_sendresp: cookie=rid=247,csn=20250227165921.534510Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jh2526,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1
syncprov_sendresp: cookie=rid=644,csn=20250227165917.751851Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mpssim,ou=people,dc=georgetown,dc=edu
I can't see any errors or anything interesting in here, but is there anything newer than 20250227165948.748550Z#000000#000#000000 that's sent to the other replicas before you restarted the provider - is rid=129 the only one that got stuck? Can't find the relevant bits in the configuration you've posted so far, are you sure the replication identity has their limits (time and size limits) set to unlimited?
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
In another instance, this time, the replica restart crashed the master.
Logs from the master: Mar 8 12:55:51 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_op_search: got a persistent search with a cookie=rid=512,csn=20250307175925.825777Z#000000#000#000000 Mar 8 12:56:39 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_search_response: cookie=rid=512,csn=20250308175550.008761Z#000000#000#000000 Mar 8 12:56:39 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_sendinfo: refreshPresent cookie=rid=512,csn=20250308175550.008761Z#000000#000#000000 Mar 8 12:56:39 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_sendresp: cookie=rid=512,csn=20250308175615.323137Z#000000#000#000000 Mar 8 12:56:39 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_sendresp: cookie=rid=512,csn=20250308175625.704652Z#000000#000#000000 Mar 8 12:56:39 aaa-prod-master-1 slapd[209555]: conn=1081 op=1 syncprov_sendresp: cookie=rid=512,csn=20250308175633.255673Z#000000#000#0000
Mar 8 12:55:20 aaa-prod-master-1 kernel: slapd[7650]: segfault at 7f5875e68420 ip 0000000000445cfa sp 00007f5f71dfb560 error 4 in slapd[419000+1b9000] likely on CPU 1 (core 1, socket 0)
Mar 8 12:55:25 aaa-prod-master-1 systemd-coredump[209442]: Process 5036 (slapd) of user 3003 dumped core.#012#012Stack trace of thread 7650:#012#0 0x0000000000445cfa connection_abandon (/var/services/openldap/libexec/slapd + 0x45cfa)#012#1 0x00000000004460d5 connection_closing (/var/services/openldap/libexec/slapd + 0x460d5)#012#2 0x0000000000447d18 connection_read (/var/services/openldap/libexec/slapd + 0x47d18)#012#3 0x000000000044741b connection_read_thread (/var/services/openldap/libexec/slapd + 0x4741b)#012#4 0x00007f792e2f0bed n/a (n/a + 0x0)#012#5 0x00007f792d889d22 n/a (n/a + 0x0)#012ELF object binary architecture: AMD x86-64
Regards, Suresh
On Wed, Mar 5, 2025 at 8:09 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Yes. New changes have been sent to other replicas. Only rid=129 is stuck until restart; after that, it catches up. Limits are set to unlimited for the replication dn.
limits dn.exact="uid=syncrepladmin,ou=RepAdmin,dc=georgetown,dc=edu" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited
Thanks, Suresh
On Wed, Mar 5, 2025 at 5:41 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Tue, Mar 04, 2025 at 11:27:21AM -0500, Suresh Veliveli wrote:
Not for rid=129.
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1
syncprov_sendresp: cookie=rid=129,csn=20250227165948.741563Z#000000#000#000000
Feb 27 11:59:53 aaa-prod-master-1 slapd[155194]: conn=3685651 op=1
syncprov_sendresp: cookie=rid=129,csn=20250227165948.748550Z#000000#000#000000
Feb 27 23:07:07 aaa-prod-master-1 slapd[693294]: conn=8478 op=1
syncprov_sendresp: cookie=rid=129,csn=20250228040634.365834Z#000000#000#000000
For other replicas: Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1
syncprov_sendresp: cookie=rid=245,csn=20250227165924.192557Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1129 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mbw86,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1
syncprov_sendresp: cookie=rid=143,csn=20250227165918.446243Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=3554242 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=res128,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1
syncprov_sendresp: cookie=rid=247,csn=20250227165921.534510Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1155 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=jh2526,ou=people,dc=georgetown,dc=edu
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1
syncprov_sendresp: cookie=rid=644,csn=20250227165917.751851Z#000000#000#000000
Feb 27 12:00:00 aaa-prod-master-1 slapd[155194]: conn=1127 op=1
syncprov_sendresp: sending LDAP_SYNC_MODIFY, dn=uid=mpssim,ou=people,dc=georgetown,dc=edu
I can't see any errors or anything interesting in here, but is there anything newer than 20250227165948.748550Z#000000#000#000000 that's sent to the other replicas before you restarted the provider - is rid=129 the only one that got stuck? Can't find the relevant bits in the configuration you've posted so far, are you sure the replication identity has their limits (time and size limits) set to unlimited?
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
openldap-technical@openldap.org