Howard Chu wrote:
Andrew Bartlett wrote:
I agree, we might be jumping at different shadows here, but your patches did fix something...
I see what you're describing now, with the kvm set with 2 CPUs. It appears to be a bug caused by the recent patch for connection_hangup() processing. Running slapd with -d15 in your test shows that a connection is closed shortly after being established and becoming readable. The bug is (probably) that we queued the reader but processed the hangup immediately, thus closing the connection before the reader executes. I'm not exactly sure why this is causing the problem on your test, since it looks like your client is closing the socket before waiting for the reply. But certainly this is the right area.
With the connection problem out of the way, the remaining slapd crash appeared to be due to some type of heap corruption, but the usual suite of tools (valgrind, efence, LBER_MEMORY_DEBUG, etc.) didn't ever reproduce the problem. On the assumption that this was related to a stack overwrite I recompiled libldap_r with a 12MB stack size instead of the default 8MB and that still didn't change the outcome. On the assumption that there was some other race condition involved, I set slapd to only 2 threads (from default of 16) to try and limit the possibilities there. This caused the crash to occur much more quickly, and using libumem it became obvious that refint was accessing already-freed memory.
So it turns out that the patch for rev 1.41 (ITS#5428) to make this into a global overlay was incorrect; it moved the loop that processed its work queue into a new function so that it could be called multiple times, but it was still using that code basically as-is, which freed its queue as it operated. (Because originally the queue was only walked once.) Part of the reason the bug was so hard to reproduce is because it tended to only show up when the queue got fairly long, and that only happened if slapd was too busy. (Thus, forcing only 2 threads made it occur sooner.)