Hello,

To start with, this is my very first post on a mailing list of this kind.

I've read much about how to do things right, like here: https://www.openldap.org/doc/admin24/troubleshooting.html or here: https://www.openldap.org/faq/data/cache/59.html

Nonetheless, please do tell me if I'm not following the standards.

I'm facing an issue with my OpenLDAP servers that I struggle to fix or even find info about.

To summarize:

We have an architecture of 2 providers with mirror mode replication for their configuration. They are LXC's hosted in a datacenter.

We have 2 consumers with mirror mode replication for their config also, and they get the DB (mdb backend) from the above providers. They are VMs hosted in a different hypervisor and datacenter than the providers.

We have a loadbalancing architecture in front that sends requests evenly to the 4 servers. The writes requests directed to the consumers are forwarded to the providers using the chain overlay.

The slapd process of those 2 consumers randomly hangs. When it happens, the slapd service is still running and listening, but not a single query can go through, even the logging stops. Like a complete freeze. The service comes back to life after some time (several minutes, different every time) without any manual intervention. Trying to manually restart the process during such even obviously fails, unless I kill -9, which is not something I want to do.

My observations so far :

- The hangs appear only during business hours (nothing during nights or week-end).

- Only the consumers hang. The providers are always fine.

- CPU / RAM / Disks are fine before / during the crashes.

- The hangs seem random and I could not find a way to trigger them at will.

- strace shows the following lines in an infinite loop during hangs.

```
[pid 1082] gettimeofday({tv_sec=1726669191, tv_usec=694473}, NULL) = 0
[pid 1082] poll([{fd=18, events=POLLIN|POLLPRI}], 1, 100) = 0 (Timeout)
[pid 1082] sched_yield()
```

It seems that the process is waiting for some data that never comes.

The fd=18 is referencing to a TCP socket connected to provider1. So it would suggests that the provider is not sending what the consumer wants.

This has led me to several suppositions:

1. Network issues

Unlikely, since only those 2 servers in the DC are experiencing misbehavior, and I can telnet / ldapsearch from the faulty consumer to the provider with no problem during hangs.

2. Syncrepl / slapo-chain misconfiguration

I might have a configuration problem, but no matter how much I review it, I don't see what's wrong. I've provided the syncrepl / slapo-chain conf in the config.ldif attachment.

I've also provided several other attachments that I thought would be helpful:

- log.txt => The slapd logs in trace level that show the last bind before the service hangs.

- ldaps_request.log => An ldapsearch performed against the faulty consumer during hang.

- starttls_request.log => Same as above, but using StartTLS => the connection seems successful and stops at the server hello TLS exchange.

Versions :

- slapd: 2.5.18 LTS

- Server: Ubuntu 22.04 LTS

I am a bit out of ideas on things I could try to fix the issues. Hence this email.

I thank you for your time and I am very grateful for any help you could give me.

Regards,

Pierre-Jean