Hallvard B Furuseth wrote:
This is a quick abstract answer, I haven't dived into the code to see what my suggestions would mean in practice. Anyway:
It sounds to me like this problem cannot be fully prevented when network operations share the same pool queue as everything else, and we have operations that can wait for other operations. All pool threads can be occupied with LDAP-level operations, so that network-level operations which the LDAP-level operations depend on can get blocked. If slapd has a design which even tries to guarantee forward progress, I'm not aware of it.
So LDAP-level operations ought to leave at least one thread free for for network-level operations. Not necessary a designated thread: A thread moving from network-level to LDAP-level operation like connection_read_thread() does could first check that at least one other thread remains available for network-level operation.
Yes...
Beyond that, my immediate reaction was that pauses are implemented at the wrong level and/or need to be split up in different types of pauses. There is no good reason cn=config's need to have slapd config variables for itself to affect network operations - nor, I hope, affect pool-level operations like pool_purgekey().
I suppose implementing things this way can appear to be too blunt. But the amount of locks required to do it at a finer level is, IMO, unmanageable. The fact that you can change global slapd config parameters (such as the number of threads, size of sockbuf buffers, etc.) makes it inherently safe for *anything* else to be active while config changes are being made. And sifting thru each variable to decide how sensitive they are, and when they are unsafe, will inevitably lead to requiring locks on every single piece of configuration data. (Want to look up an attribute type, or objectclass? Or a database suffix? etc. etc. etc... There are countless things we do arbitrarily that simply won't work if we allow arbitrary threads to continue to run while their configurations are changed from under them.)
E.g. cn=config could use a slapd_config_pause() call to ask for lone access to the config variables instead of thread_pool_pause() call. Slapd operations that must respect such pauses should thus call a slapd_config_pausecheck() macro/function, not thread_pool_pausecheck() or depend on the pool to respect slapd-level pauses. connection_read_thread() can then go ahead and use the connection independently of those pauses. If it reaches connection_operation(), it'll need to check for slapd pause.
Yes, I've thought about this approach too. That makes the pause mechanisms even messier.
Another possibility is to just try to read 1 byte in the listener thread, to detect the hangup there when we have no other means to discover it. We would have to make sure to be able to unget this byte back to the bottom of the sockbuf stack if there's valid data. This will affect the throughput of the listener thread, but it may not be too terrible a hit.