The documentation of MDB_NOSYNC says:
If the filesystem preserves write order and the MDB_WRITEMAP flag is not used, transactions exhibit ACI (atomicity, consistency, isolation) properties and only lose D (durability).
In practice, what file system + options preserve write order?
Asked this question elsewhere from Howard. I got the answer that ZFS should do it, and ext4 with data=ordered _may_ do it. It seems to me that ext4 with data=journal should be a very safe bet, too, would it not? Are there any other recommendations?
I ran a few microbenchmarks to compare ext4 data=ordered and data=journal. With the default sync, they can do about 600 and 400 write txn/s. With nosync + an mdb_env_sync() every second, they are both at about 200k txn/s. For reference, the system can do about 5 million read txn/s. That makes me hopeful that ext4 with data=journal could be a good option.
Cheers, Gábor Melis
Gábor Melis wrote:
The documentation of MDB_NOSYNC says:
If the filesystem preserves write order and the MDB_WRITEMAP flag is not used, transactions exhibit ACI (atomicity, consistency, isolation) properties and only lose D (durability).
In practice, what file system + options preserve write order?
Asked this question elsewhere from Howard. I got the answer that ZFS should do it, and ext4 with data=ordered _may_ do it. It seems to me that ext4 with data=journal should be a very safe bet, too, would it not? Are there any other recommendations?
ext4 with data=journal should never be used, it hides unrecoverable fsync errors, as discussed here https://www.usenix.org/conference/atc20/presentation/rebello (but also see my notes on their work, their testing methods aren't quite right https://twitter.com/hyc_symas/status/1284689627295682563 )
Also, ext4 with data=journal is just way too slow. For dedicated processing workloads you're better off with LMDB on a raw block device, no filesystem at all.
I ran a few microbenchmarks to compare ext4 data=ordered and data=journal. With the default sync, they can do about 600 and 400 write txn/s. With nosync + an mdb_env_sync() every second, they are both at about 200k txn/s. For reference, the system can do about 5 million read txn/s. That makes me hopeful that ext4 with data=journal could be a good option.
It's fine if you don't care about I/O errors.
On Tue, 18 Aug 2020 at 19:10, Howard Chu hyc@symas.com wrote:
ext4 with data=journal should never be used, it hides unrecoverable fsync errors, as discussed here https://www.usenix.org/conference/atc20/presentation/rebello (but also see my notes on their work, their testing methods aren't quite right https://twitter.com/hyc_symas/status/1284689627295682563 )
Also, ext4 with data=journal is just way too slow. For dedicated processing workloads you're better off with LMDB on a raw block device, no filesystem at all.
But it seems that raw block devices can reorder writes. Are you suggesting using a raw block device for sync writes?
That makes me hopeful that ext4 with data=journal could be a good option.
It's fine if you don't care about I/O errors.
Thank you. It's rather incredible where file systems are after all these years.
openldap-technical@openldap.org