On Thu, Feb 26, 2015 at 3:46 PM, Howard Chu hyc@symas.com wrote:
Matthew Moskewicz wrote:
warnings: new to list, first post, lmdb noob.
[snip]
https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp
however, if i *both*
- run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
- modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
POSIX_MADV_SEQUENTIAL);
then i can get >1 reader to run without being IO limited.
This is quite timing-dependent - if you start your multiple readers at exactly the same time and they run at exactly the same speed, then they will all be using the same cached pages and all of the readers can run at the full bandwidth of the disk. If they're staggered or not running in lockstep, then you'll only get partial performance.
thanks for the quick reply. to clarify: yes, this is indeed the case. when/if the readers are reading 'near' each other (within cache size) there is no issue, but over time they drift out of sync, and this is the case i'm considering / when i'm having an issue. these are long-running processes that loop over the entire db 200GB lmdb many times over days, at around 2 hours per epoch (iteration over all data).
when i say i can get >1 reader to be not IO limited with my changes, i mean that things continue to work (not be IO limited) even as the readers go out of sync. the processes happen to output information sufficient to deduce when they have de-synced by more than the amount of system memory in terms of the lmdb offset at which they are reading. empirically: without my changes, for a particular 2 readers case, the readers would reliably drop out of sync within a few hours and slow down by at least ~2X (getting perhaps ~20MB/s bandwidth); with the changes i've had 2 runs going to multiple days without issue.
for my microbenchmarking i simulate the out-of-sync-ness and take care to ensure i'm not reading cached areas, either by flushing the caches or by just carefully choosing offsets into a 200GB lmdb on a machine with only 32GB ram. i'd prefer to 'clear the cache' for all tests, but that doesn't actually seem possible when there is a running process that has the entire lmdb mmap()'d. that is, i don't know of any method to make the kernel drop the clean cached mmap()'d pages out of memory. but, caveats aside, i'm claiming that:
a) with the patch+readahead i get full read perf, even when the readers are out of sync / streaming though well-separated (i.e. by more than the size of system memory) parts of the lmdb. b) without them i see much reduced read performance (presumably due to seek trashing), sufficient to cause the caffe processes in question to slow down by > 2X.
for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt
similarly, using a sequential read microbenchmark designed to model the caffe reads from here: https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc
if i run one reader, i get 180MB/s bandwidth. with two readers, but neither (1) nor (2) above, each gets ~30MB/s bandwidth. with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.
The other point to note is that sequential reads in LMDB won't remain truly sequential (as seen by the storage device) after a few rounds of inserts/deletes/updates. Once you get any element of seek/random I/O in here your madvise will be useless.
yes, makes sense. i should have noted that, in this use model, the lmdbs are in-order-write-once and then read-only thereafter -- they are created and used in this manner specifically to allow for sequential reads. i'd assume this is not actually reliable in general due to the potential for filesystem-level fragmentation, but i guess in practice it's okay. often, these lmdbs are being written to spinners that are 'fresh' and don't have much filesystem level churn.
mwm