This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
Any thoughts?
Thanks, suresh
On Tue, Dec 24, 2024 at 10:41 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Regards, Suresh
On Sat, Dec 28, 2024 at 10:05 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts?
Thanks, suresh
On Tue, Dec 24, 2024 at 10:41 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where replication stops.
Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060050.594465Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=xxxx,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=xxxxx,ou=People,dc=georgetown,dc=edu (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_message_to_entry: rid=143 DN: uid=xxxxx,ou=People,dc=georgetown,dc=edu, UUID: db41232e-527d-103f-995c-4344349468b5 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241223060104.603554Z#000000#000#000000 tid 0x7f5aed0fb640 Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_search (0) Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 uid=yyyyy,ou=People,dc=georgetown,dc=edu Dec 23 01:01:23 aaa-prod-aws-3 slapd[1182444]: syncrepl_entry: rid=143 be_modify uid=yyyy,ou=People,dc=georgetown,dc=edu (0)
No "syncrep_" messages were logged after the above.
Regards, Suresh
On Mon, Dec 23, 2024 at 2:42 PM Quanah Gibson-Mount quanah@fast-mail.org wrote:
--On Friday, December 20, 2024 5:46 PM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
I do have these but no syncrepl_sync.
Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_message_to_entry: rid=129 DN: uid=xxxx,ou=xxxx,dc=georgetown,dc=edu, UUID: 37387976-3ae0-103f-94fc-3b4be3361dc7 Dec 12 08:59:30 aaa-prod-gcp-9 slapd[4165]: syncrepl_entry: rid=129 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241212135921.652395Z#000000#000#000000 tid 0x7f671b7fe640
Yes sorry, that was a typo on my end.
"syncrepl_" sync messages. Specifically you want to see the ones that have "be_" in them, but not "be_search". I.e., be_add, be_modify, etc. They show the result of the operation, something like:
syncrepl_entry: rid=100 be_modify <DN> (<result code>)
--Quanah
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
On Sun, Dec 29, 2024 at 01:28:42PM -0500, Suresh Veliveli wrote:
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Hi Suresh, anything in the provider logs around that time? All consumers messages will be tagged with a specific "conn=xxx op=yyy" which you can discover e.g. by looking for the cookie it sends at the beginning of the session.
Couple of other questions: - is the TCP connection alive as far as the OS is concerned (I see in the thread you've confirmed TCP keepalive is enabled, correct?) - could you post the cn=monitor info for the consumer? The objectclass to look for is olmSyncReplInstance
Thanks,
Another instance: Yes, TCP keepalive is enabled.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
*Consumer logs:*
Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: do_syncrep2: rid=147 cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250101065905.147164Z#000000#000#000000 tid 0x7f8e4a5fd640 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_queue_csn: queueing 0x7f8e40152360 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_graduate_commit_csn: removing 0x7f8e40152360 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_queue_csn: queueing 0x7f8e4036c6c0 20250101065905.147164Z#000000#000#000000 Jan 1 01:59:12 aaa-prod-aws-7 slapd[1161089]: slap_graduate_commit_csn: removing 0x7f8e4036c6c0 20250101065905.147164Z#000000#000#000000
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000
(Nothing after the above for rid=147)
*After restarting Consumer:*
Jan 2 09:29:54 aaa-prod-aws-7 slapd[2750000]: slap_get_csn: conn=-1 op=0 generated new csn=20250102142954.929120Z#000000#000#000000 manage= 0 Jan 2 09:29:55 aaa-prod-aws-7 slapd[2750000]: do_syncrep1: rid=147 starting refresh (sending cookie=rid=147,csn=20250101065905.147164Z#0000 00#000#000000) Jan 2 09:29:59 aaa-prod-aws-7 slapd[2750000]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) csn=(none) tid 0x7f3288dfd640 Jan 2 09:29:59 aaa-prod-aws-7 slapd[2750000]: syncrepl_entry: rid=147 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) csn=(none) tid 0x7f3288dfd640
Thanks, Suresh
On Thu, Jan 2, 2025 at 6:34 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Sun, Dec 29, 2024 at 01:28:42PM -0500, Suresh Veliveli wrote:
Another instance where replication is stuck and not recovering.
# requesting: contextCSN contextCSN: *20241229135907.725117Z#000000#000#000000* aaa-prod-aws-10:2636 # requesting: contextCSN contextCSN:* 20241228185913.665451Z#000000#000#000000*
*Log info:* Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: do_syncrep2: rid=650 cookie=rid=650,csn=20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: syncrepl_entry: rid=650 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20241228185913.665451Z#000000#000#000000 tid 0x7f26ee5fd640 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0dcee50 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_queue_csn: queueing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000 Dec 28 13:59:21 aaa-prod-aws-10 slapd[1161864]: slap_graduate_commit_csn: removing 0x7f26e0f34360 20241228185913.665451Z#000000#000#000000
Nothing gets logged about replication after the above.
Am I missing something?
Hi Suresh, anything in the provider logs around that time? All consumers messages will be tagged with a specific "conn=xxx op=yyy" which you can discover e.g. by looking for the cookie it sends at the beginning of the session.
Couple of other questions:
- is the TCP connection alive as far as the OS is concerned (I see in the thread you've confirmed TCP keepalive is enabled, correct?)
- could you post the cn=monitor info for the consumer? The objectclass to look for is olmSyncReplInstance
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn=*20250102015911.702871Z#000000#000#000000* olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
*Consumer logs:* Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Nothing about replication is logged after the above.
Thanks, Suresh
On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
The OS is Rock9, AWS EC2 instance.
On Thu, Jan 2, 2025 at 10:32 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
All the relevant logs and info:
dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor structuralObjectClass: olmSyncReplInstance creatorsName: modifiersName: createTimestamp: 20241209130653Z modifyTimestamp: 20241209130653Z olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ olmSRConnection: IP=172.20.86.12:49880 olmSRSyncPhase: Persist olmSRNextConnect: 00000101000000Z olmSRLastConnect: 20241229203510Z olmSRLastContact: 20250102015934Z olmSRLastCookieRcvd: rid=152,csn= *20250102015911.702871Z#000000#000#000000* olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor subschemaSubentry: cn=Subschema hasSubordinates: FALSE
*Consumer:* netstat -an | grep 49880 tcp 0 0 172.20.86.12:49880 172.17.21.52:636 ESTABLISHED
*Master:* netstat -an | grep 172.20.86.12 tcp 0 0 172.17.21.52:636 172.20.86.12:49880 ESTABLISHED
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
*Consumer logs:* Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
Nothing about replication is logged after the above.
Thanks, Suresh
On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
Another instance: Yes, TCP keepalive is enabled.
So is the TCP connection still open from the point of both servers? See in netstat or ss.
aaa-prod-aws-7:1636 # requesting: contextCSN *contextCSN: 20250101065905.147164Z#000000#000#000000*
aaa-prod-aws-7:2636 # requesting: contextCSN contextCSN: 20250102140005.217756Z#000000#000#000000
dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor objectClass: olmSyncReplInstance cn: Consumer 147
All the data in cn=monitor is contained in the operational attributes, as such, you'll have to request them either by name specifically, objectClass ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want regular attributes as well.
*Consumer logs:*
[...]
(Nothing after the above is logged regarding replication)
*Master:*
Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 syncprov_sendresp: cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 (Nothing after the above for rid=147)
This gives you the string to search for: searching for "conn=1034 op=1" here would give you the messages related to the replication session above. You'll see what happens on the provider and correlate that with what the consumer. For every new consumer session there will be a new "conn=xxx op=yyy" to search for.
Regards,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
--On Friday, January 3, 2025 7:52 AM -0500 Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
The OS is Rock9, AWS EC2 instance.
Question - Was this working on an earlier OpenLDAP version, and just started having issues with 2.6.9? I was re-reading your initial post and that wasn't clear.
Thanks!
--Quanah
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org http://openldap.org/). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník <ondra@mistotebe.net mailto:ondra@mistotebe.net> wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com http://www.symas.com/ Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli Suresh.Veliveli@georgetown.edu wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM <ogg@sr375.com mailto:ogg@sr375.com> wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli <Suresh.Veliveli@georgetown.edu mailto:Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org http://openldap.org/). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník <ondra@mistotebe.net mailto:ondra@mistotebe.net> wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com http://www.symas.com/ Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote:
This is another instance where the replication stops.
aaa-prod-aws-12:1636 # requesting: contextCSN contextCSN: *20250102015911.702871Z#000000#000#000000*
*Master logs:* Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 syncprov_sendresp: cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
I have not experienced the master crashing but I have also experienced stalled replication regularly over the last 2 years. We’ve been on 2.46 and have keep alive enabled. We were hoping an upgrade would resolve the problem but it seems to exist on latest versions as well. I am also on aws ec2 instances.
On Thu, Jan 16, 2025 at 8:02 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
--On Friday, January 17, 2025 10:51 AM -0700 Thomas Pritchard pritchardtw@gmail.com wrote:
I have not experienced the master crashing but I have also experienced stalled replication regularly over the last 2 years. We've been on 2.46 and have keep alive enabled. We were hoping an upgrade would resolve the problem but it seems to exist on latest versions as well. I am also on aws ec2 instances.
I've been running OpenLDAP in AWS on EC2 instances for the last 3 years at two different companies, and have never encountered this behavior.
--Quanah
My sysadmin says nothing is amiss with the interface.
Regards, Suresh
On Thu, Jan 16, 2025 at 10:02 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
openldap-technical@openldap.org