The recent work on expanding dynamic group functionality in the dynlist overlay seems to have been
a bad idea. It makes an already fairly complex overlay even more complicated, and it puts a lot more
work into the read side of operations, which adds up to quite noticeable slowdowns in search performance
on deployments that make heavy use of dynamic groups with large memberships.
It appears that in these situations, the approach used in the autogroup overlay (which has been
in contrib since 2007) is better. It moves all of the membership management into the write side
of operations, updating memberships whenever dynamic group definitions are modified, or when
member entries are written/updated/etc. As such, it allows dynamically defined group memberships
to be read/searched at full speed, as if they were static groups. The search performance difference
between autogroup and the dynlist approach is pretty drastic when large groups and large numbers
of groups are in use. As such I believe autogroup should be recommended, going forward, and we
should move it from contrib into the core code.
For similar reasons, dynamically managing memberOf attributes in dynlist also has a major
impact on search performance. And as alluded to before, adding that functionality to dynlist has
made the code quite a bit more complex. Again for performance reasons, it's better to manage
this attribute on the write side, updating a real attribute in the DB when groups and memberships
are modified, instead of doing the lookup work on the read/search side. As such, we should
reverse the decision to deprecate the memberof overlay. There were a couple problems that the
overlay presented in replication environments before, that prompted us to deprecate it, but I
believe those problems have now been resolved. The first one, referenced in ITS#7400, had to do
with the actual memberof attributes getting replicated even though they weren't meant to be.
The solution for that should have been to specify that memberof should be excluded from replication
using "exattr" in a syncrepl consumer config. The exattr option wasn't behaving as intended in
the past, and using it would cause data desyncs, but that problem was fixed long ago.
The other problem was simply due to the lack of ordering guarantees in syncrepl refreshes,
which prevented the memberof overlay from updating an member's memberof attribute if the group
entry got replicated before the member entry. That problem has been solved in ITS#10167, simply
adding a check whenever new entries are added, to see if they're already members of any existing groups.
As such, the memberof overlay should be perfectly fine for use in replication scenarios now.
More testing of those scenarios is welcome.
With statically managed member and memberof attributes, the major hits to search performance have
been removed. We're still trying dynamically managed nesting of groups though, which is what the
new nestgroup overlay (ITS#10161) is for. Again, we need testing to see how much performance impact
there is from this much-reduced overhead in the read/search side of things. The config is also a
lot easier to understand in its own overlay, instead of shoe-horned into the dynlist config.
Testing and feedback appreciated.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
If anyone remembers fsync-gate https://danluu.com/fsyncgate/ which showed a lot of vulnerabilities in
other popular DBMSs, some other research was published on the topic as well
https://www.usenix.org/conference/atc20/presentation/rebello
I originally discussed this on twitter back in 2020 but wanted to summarize again here.
As usual with these types of reports, there are a lot of flaws in their test methodology,
which invalidates some of their conclusions.
In particular, I question the validity of the failure scenarios their CuttleFS simulator produces.
Specifically, they claim that multiple systems exhibit False Failures after fsync reports a failure,
but actually (partially) succeeded. In the case of LMDB, where a 1-page synchronous write is involved,
this is just an invalid test.
They assume that the relevant sector that LMDB cares about is successfully written, but an I/O error
occurs on some other sector in the page. And so while LMDB invalidates the commit in memory, a cache
flush and subsequent page-in will read the updated sector. But in the real world, if there are hard
I/O errors on these other sectors, they will most likely also be unreadable, and a subsequent page-in
will also fail. So at least for LMDB, there would be no false failure.
The failure modes they're modeling don't reflect reality.
Leaving that issue aside, there's also the point that modern storage devices are now using 4KB sectors,
and still guarantee atomic sector writes, so the partial success scenario they describe can't even happen.
This is a bunch of academic speculation, with a total absence of real world modeling to validate the
failure scenarios they presented.
The other failures they report, on ext4fs with journaled data, are certainly disturbing. But we always
recommend turning that journaling off with LMDB; it's redundant with LMDB's own COW strategy and harms
perf for no benefit.
Of course, you don't even need to trust the filesystem, you can just use LMDB on a raw block device.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/