On Tue, Sep 07, 2010 at 05:09:07AM -0700, Howard Chu wrote:
We've talked about doing this isolation in the first refresh upon slapd startup. That might still be a good idea.
It would certainly help to keep the apparent promises made by things like the uniqueness overlay. Alternatively you could take the view that the data will converge eventually and that is all that the LDAP standards promise.
But that reminds me of the joys of being a Sun sysadmin back in the 1980s, when Sun's boot scripts always started their NFS client before starting their NFS server. If two machines cross-mounted each other's filesystems and both were booting at the same time they would hang, each waiting for the other's NFS server to respond to their mount request.
I remember that - one of the many reasons for switching to automounters (along with their own set of problems)... The alternative was 'soft' mounts, which may be a better model for solving the mirrormode problem.
Mirrormode and multimaster bootstrapping becomes a lot harder if you implement this type of isolation during startup refresh.
I was originally going to suggest that servers should not listen for connections until the first refresh completes, but that would indeed cause the deadlock you describe. How about having master servers listen on an extra port which is used purely for replication and *is* available immediately? The main LDAP port would thus remain closed until the server is synced-up, making it much easier for load-balancers to do the right thing. [This should really be a separate discussion as it is not directly related to the bug.]
Doing it on every refresh seems far more problematic, because without some type of multi-version concurrency control, that means making the server non-responsive until the refresh completes.
That may not be a problem with refresh-and-persist, as in normal circumstances I would expect updates to arrive at the consumer in the same order they hit the supplier (so this bug could not trigger). More difficult for scheduled refresh mode though. Could the consumer server simply write-lock every entry involved in the refresh while it processes the list, and then commit the whole lot in one DB transaction?
Andrew