The attached patch makes O_DIRECT work on Linux in BerkeleyDB 4.5.20. (You will need to manually define LINUX_NEEDS_PAGE_ALIGNMENT if you're using a kernel older than 2.6.)
The main reason to use this patch is to conserve memory - ordinarily, all the I/O that BDB does to its files gets cached in the Linux filesystem buffer cache. This caching is redundant since BDB always does its own caching, and it effectively makes the BDB environment consume twice as much memory as it needs. Using O_DIRECT on I/Os disables the filesystem buffer cache for those I/Os, thus freeing up a sizable chunk of memory.
The caching problem is particularly aggravated on Linux because the memory manager doesn't give program pages higher priority than cache pages. So when your system is tight on memory, the kernel will start swapping program data pages before it starts reclaiming buffer cache pages, and application performance plummets. (Possibly that indicates a kernel bug, or at least a misfeature.)
Note that you must configure BerkeleyDB with --enable-o_direct to enable the support, and you must add "set_flags DB_DIRECT_DB" to your DB_CONFIG to enable it in a particular environment.
With this patch, a slapd that occupies 6.8GB on a system with 8GB of RAM can run continuously without swapping, delivering a sustained 11,500 authentications per second. Without the patch, swapping starts when the process hits the 4.5GB mark (because over 3GB of buffer cache is in use), and performance drops to only *hundreds* of authentications per second.
Howard Chu wrote:
The caching problem is particularly aggravated on Linux because the memory manager doesn't give program pages higher priority than cache pages. So when your system is tight on memory, the kernel will start swapping program data pages before it starts reclaiming buffer cache pages, and application performance plummets. (Possibly that indicates a kernel bug, or at least a misfeature.)
Thanks to Rik van Riel for enlightening me here about /proc/sys/vm/swappiness. The default setting on the system was 60 (range 0-100) but setting it down to 10 reduced the problem considerably. With a setting of 10 only 300MB of the slapd process got swapped out, and for the most part the swapd's were idle after that. Total throughput is around 11,200 authentications per second. Not quite as fast as the Direct I/O case, but much much better than before. Some time apparently is still lost due to swapping - the swap in use decreases slowly, indicating that the swapped out data pages are still needed. I suppose running with swappiness=0 would eliminate that, will try that after the current swappiness=10 test completes.
(There's a downside to the O_DIRECT patch - it requires every buffer allocation that BDB makes to be overallocated by 512 or 4096 bytes, so that the buffer can be properly aligned. But it certainly yields the best performance overall.)
Note that you must configure BerkeleyDB with --enable-o_direct to enable the support, and you must add "set_flags DB_DIRECT_DB" to your DB_CONFIG to enable it in a particular environment.
With this patch, a slapd that occupies 6.8GB on a system with 8GB of RAM can run continuously without swapping, delivering a sustained 11,500 authentications per second. Without the patch, swapping starts when the process hits the 4.5GB mark (because over 3GB of buffer cache is in use), and performance drops to only *hundreds* of authentications per second.
With this patch, a slapd that occupies 6.8GB on a system with 8GB of RAM can run continuously without swapping, delivering a sustained 11,500 authentications per second. Without the patch, swapping starts when the process hits the 4.5GB mark (because over 3GB of buffer cache is in use), and performance drops to only *hundreds* of authentications per second.
This is interesting. Did you test performance under other workloads ? Reason I ask is that every time I've tried O_DIRECT in the past performance suffered (significantly) in the case where I/O is being done (I suspect due to reduced concurrency because the application must block in cases where it wouldn't have when using OS buffering). Other database productst that I keep track of (e.g. Postgresql) report similar findings.
David Boreham wrote:
With this patch, a slapd that occupies 6.8GB on a system with 8GB of RAM can run continuously without swapping, delivering a sustained 11,500 authentications per second. Without the patch, swapping starts when the process hits the 4.5GB mark (because over 3GB of buffer cache is in use), and performance drops to only *hundreds* of authentications per second.
This is interesting. Did you test performance under other workloads ? Reason I ask is that every time I've tried O_DIRECT in the past performance suffered (significantly) in the case where I/O is being done (I suspect due to reduced concurrency because the application must block in cases where it wouldn't have when using OS buffering). Other database productst that I keep track of (e.g. Postgresql) report similar findings.
Testing with swappiness=0 actually did turn out faster, by a tiny margin. Peak throughput was 11609 auths/second @ 160 client threads with swappiness=0, vs 11567/sec @ 140 client threads with O_DIRECT. Peak process size was also slightly smaller without O_DIRECT.
I think the difference is so small because the caches were already at a 99% hit rate; very few requests would actually need to do I/O. But in those cases where the data wasn't in the slapd or the BDB cache, it had a chance of being in the fs buffer cache, thus the higher throughput without O_DIRECT.
At this point I'm going to forget about the O_DIRECT patch.
Howard Chu wrote:
I think the difference is so small because the caches were already at a 99% hit rate; very few requests would actually need to do I/O.
Right, that was my concern. When I tried this (was done on both Linux and NT), I saw performance in the non-100% hit rate case fall significantly.
David Boreham wrote:
Howard Chu wrote:
I think the difference is so small because the caches were already at a 99% hit rate; very few requests would actually need to do I/O.
Right, that was my concern. When I tried this (was done on both Linux and NT), I saw performance in the non-100% hit rate case fall significantly.
Hm... Something doesn't sound right here. If the buffer cache is actually mitigating that effect, that implies that you could allocate more space to the BDB cache and get the same benefit. (Indeed, looking at the system status at the end of the test, with 4.2GB in the buffer cache, that tells me that I should have raised our BDB cache to 4.2GB.) The other side of this is that using O_DIRECT eliminates the double-buffering that's occurring, so each I/O that actually needs to be done does one less memcpy.
But again, using O_DIRECT means every I/O is synchronous, so I guess that might wreak havoc on any queuing optimizations in the fs.
Howard Chu wrote:
But again, using O_DIRECT means every I/O is synchronous, so I guess that might wreak havoc on any queuing optimizations in the fs.
And readahead, if it's working for the application.
In my tests I ran a battery of workloads, not just operations that benefitted from a warm, large cache. Some speeded up, but others got much worse.