I have not experienced the master crashing but I have also experienced stalled replication regularly over the last 2 years. We’ve been on 2.46 and have keep alive enabled. We were hoping an upgrade would resolve the problem but it seems to exist on latest versions as well. I am also on aws ec2 instances.
On Thu, Jan 16, 2025 at 8:02 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
I will do it, but why is the master crashing when restarting a stalled replica?
Thanks, Suresh
On Thu, Jan 16, 2025 at 9:58 PM ogg@sr375.com wrote:
yes... check your interface stats...
On Jan 16, 2025, at 6:40 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
The host is an aws ec2 instance.
Regards, Suresh
On Thu, Jan 16, 2025 at 8:44 PM ogg@sr375.com wrote:
Have we verified the connection is error-free and run a memory test on this host? It seems there are issues with a stable connection to the network.
On Jan 16, 2025, at 5:35 PM, Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Had another crash. Attached is the log from " thread apply all bt full".
Regards, Suresh
On Thu, Jan 16, 2025 at 7:16 PM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Any thoughts on this?
Regards, Suresh
On Mon, Jan 13, 2025 at 10:42 AM Suresh Veliveli < Suresh.Veliveli@georgetown.edu> wrote:
Hi Ondřej,
Attached is the file from the last crash for "thread apply all bt full". I built it from the src (openldap.org). The installation is prefixed to /var/services/openldap directory. I do have "stats sync" log level enabled. Our logs are huge, I could get the necessary info if you can tell what I need to look for.
Thanks, Suresh
On Mon, Jan 13, 2025 at 7:31 AM Ondřej Kuzník ondra@mistotebe.net wrote:
On Thu, Jan 02, 2025 at 10:32:23PM -0500, Suresh Veliveli wrote: > This is another instance where the replication stops. > > aaa-prod-aws-12:1636 > # requesting: contextCSN > contextCSN: *20250102015911.702871Z#000000#000#000000* > > *Master logs:* > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 > Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 > syncprov_sendresp: > cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 > > Nothing about rid=152 is logged after the above
Hi Suresh, you shouldn't be searching for the rid= on the provider, you might use it to find the relevant "conn=xxx op=yyy" string and then search for that.
When you encounter this stall, could you do a 'thread apply all bt full' on the provider?
Given you also reported a crash in the server, where are you getting packages from? Are you sure you are loading all modules from there and not from an old version etc.? Would you be able to attach the provider logs with at least sync+stats log level enabled? You can redact any confidential information as needed.
Thanks,
-- Ondřej Kuzník Senior Software Engineer Symas Corporation http://www.symas.com Packaged, certified, and supported LDAP solutions powered by OpenLDAP
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work) <trace_output.txt>
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
-- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)