Fwd: multiple sequential lmdb readers + spinning media = slow / thrashes?

List overview All Threads
Download

newer

older

RE24 testing call #3 (2.4.41),...

Security alerts on OpenLDAP...

Matthew Moskewicz

26 Feb 2015 26 Feb '15

7:50 p.m.

warnings: new to list, first post, lmdb noob.

i'm a caffe user: https://github.com/BVLC/caffe

in one use case, caffe sequentially streams though >100GB lmdbs at a rate of ~30MB/s in blocks of about 40MB. however, if multiple caffe processes are reading the same lmdb (opened with MDB_RDONLY), read performance becomes limiting (i.e. the processes become IO bound), even though the disk has sufficient read bandwidth (say ~180MB/s). some of the relevant caffe lmdb code is here:

https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both* 1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever 2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize, POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited.

for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt

similarly, using a sequential read microbenchmark designed to model the caffe reads from here: https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc

if i run one reader, i get 180MB/s bandwidth. with two readers, but neither (1) nor (2) above, each gets ~30MB/s bandwidth. with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

any advice?

mwm

PS: backstory (skippable): caffe originally used LevelDB to get better read performance for sequentially loading sets of ~1M 227x227x3 raw images (~200GB data). typically processing time is ~2 hours for this data set size, yielding a read BW need of 30MB/s or so. it's not really clear if/why LevelDB was uses aside from the fact that the caffe author was a google intern at the time he wrote it, but anecdotally i think the claim is that reading the raw .jpgs had perf. issues, although it's unclear exactly what or why. i guess it was the usual story about not getting sequential reads without using LevelDB. they switched to lmdb a while back.

openldap-devel@openldap.org

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Howard Chu

26 Feb 26 Feb

8:46 p.m.

Matthew Moskewicz wrote:

...

warnings: new to list, first post, lmdb noob.

i'm a caffe user: https://github.com/BVLC/caffe

in one use case, caffe sequentially streams though >100GB lmdbs at a rate of ~30MB/s in blocks of about 40MB. however, if multiple caffe processes are reading the same lmdb (opened with MDB_RDONLY), read performance becomes limiting (i.e. the processes become IO bound), even though the disk has sufficient read bandwidth (say ~180MB/s). some of the relevant caffe lmdb code is here:

https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both*

run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever

modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,

POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited.

This is quite timing-dependent - if you start your multiple readers at exactly the same time and they run at exactly the same speed, then they will all be using the same cached pages and all of the readers can run at the full bandwidth of the disk. If they're staggered or not running in lockstep, then you'll only get partial performance.

...

for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt

similarly, using a sequential read microbenchmark designed to model the caffe reads from here: https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc

if i run one reader, i get 180MB/s bandwidth. with two readers, but neither (1) nor (2) above, each gets ~30MB/s bandwidth. with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

The other point to note is that sequential reads in LMDB won't remain truly sequential (as seen by the storage device) after a few rounds of inserts/deletes/updates. Once you get any element of seek/random I/O in here your madvise will be useless.

...

any advice?

mwm

PS: backstory (skippable): caffe originally used LevelDB to get better read performance for sequentially loading sets of ~1M 227x227x3 raw images (~200GB data). typically processing time is ~2 hours for this data set size, yielding a read BW need of 30MB/s or so. it's not really clear if/why LevelDB was uses aside from the fact that the caffe author was a google intern at the time he wrote it, but anecdotally i think the claim is that reading the raw .jpgs had perf. issues, although it's unclear exactly what or why. i guess it was the usual story about not getting sequential reads without using LevelDB. they switched to lmdb a while back.

mailto:openldap-devel@openldap.org

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Matthew Moskewicz

11:36 p.m.

On Thu, Feb 26, 2015 at 3:46 PM, Howard Chu hyc@symas.com wrote:

...

Matthew Moskewicz wrote:

...
warnings: new to list, first post, lmdb noob.

[snip]

...

...
https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both*

run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever

modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,

POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited.

This is quite timing-dependent - if you start your multiple readers at exactly the same time and they run at exactly the same speed, then they will all be using the same cached pages and all of the readers can run at the full bandwidth of the disk. If they're staggered or not running in lockstep, then you'll only get partial performance.

thanks for the quick reply. to clarify: yes, this is indeed the case. when/if the readers are reading 'near' each other (within cache size) there is no issue, but over time they drift out of sync, and this is the case i'm considering / when i'm having an issue. these are long-running processes that loop over the entire db 200GB lmdb many times over days, at around 2 hours per epoch (iteration over all data).

when i say i can get >1 reader to be not IO limited with my changes, i mean that things continue to work (not be IO limited) even as the readers go out of sync. the processes happen to output information sufficient to deduce when they have de-synced by more than the amount of system memory in terms of the lmdb offset at which they are reading. empirically: without my changes, for a particular 2 readers case, the readers would reliably drop out of sync within a few hours and slow down by at least ~2X (getting perhaps ~20MB/s bandwidth); with the changes i've had 2 runs going to multiple days without issue.

for my microbenchmarking i simulate the out-of-sync-ness and take care to ensure i'm not reading cached areas, either by flushing the caches or by just carefully choosing offsets into a 200GB lmdb on a machine with only 32GB ram. i'd prefer to 'clear the cache' for all tests, but that doesn't actually seem possible when there is a running process that has the entire lmdb mmap()'d. that is, i don't know of any method to make the kernel drop the clean cached mmap()'d pages out of memory. but, caveats aside, i'm claiming that:

a) with the patch+readahead i get full read perf, even when the readers are out of sync / streaming though well-separated (i.e. by more than the size of system memory) parts of the lmdb. b) without them i see much reduced read performance (presumably due to seek trashing), sufficient to cause the caffe processes in question to slow down by > 2X.

...

...
for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt

similarly, using a sequential read microbenchmark designed to model the caffe reads from here: https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc

if i run one reader, i get 180MB/s bandwidth. with two readers, but neither (1) nor (2) above, each gets ~30MB/s bandwidth. with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

The other point to note is that sequential reads in LMDB won't remain truly sequential (as seen by the storage device) after a few rounds of inserts/deletes/updates. Once you get any element of seek/random I/O in here your madvise will be useless.

yes, makes sense. i should have noted that, in this use model, the lmdbs are in-order-write-once and then read-only thereafter -- they are created and used in this manner specifically to allow for sequential reads. i'd assume this is not actually reliable in general due to the potential for filesystem-level fragmentation, but i guess in practice it's okay. often, these lmdbs are being written to spinners that are 'fresh' and don't have much filesystem level churn.

mwm

Milosz Tanski

8:50 p.m.

New subject: multiple sequential lmdb readers + spinning media = slow / thrashes?

Matthew,

If you are talking about rotational media, the more reader you add the worse your aggregate bandwidth is going to be... Since LMDB is storing it as a btree, the readers have to random access which turns into a lot of seek. Seek time ends up being amortized as a higher average time to read a block / page and your aggregate bandwidth disappears.

If you have enough memory to store most of the data, or your working set it only a small subset of that data this won't be as visible.

Best, - Milosz

On Thu, Feb 26, 2015 at 5:50 PM, Matthew Moskewicz moskewcz@alumni.princeton.edu wrote:

...

warnings: new to list, first post, lmdb noob.

i'm a caffe user: https://github.com/BVLC/caffe

in one use case, caffe sequentially streams though >100GB lmdbs at a rate of ~30MB/s in blocks of about 40MB. however, if multiple caffe processes are reading the same lmdb (opened with MDB_RDONLY), read performance becomes limiting (i.e. the processes become IO bound), even though the disk has sufficient read bandwidth (say ~180MB/s). some of the relevant caffe lmdb code is here:

https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both*

run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever

modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,

POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited.

for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt

similarly, using a sequential read microbenchmark designed to model the caffe reads from here: https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc

if i run one reader, i get 180MB/s bandwidth. with two readers, but neither (1) nor (2) above, each gets ~30MB/s bandwidth. with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

any advice?

mwm

PS: backstory (skippable): caffe originally used LevelDB to get better read performance for sequentially loading sets of ~1M 227x227x3 raw images (~200GB data). typically processing time is ~2 hours for this data set size, yielding a read BW need of 30MB/s or so. it's not really clear if/why LevelDB was uses aside from the fact that the caffe author was a google intern at the time he wrote it, but anecdotally i think the claim is that reading the raw .jpgs had perf. issues, although it's unclear exactly what or why. i guess it was the usual story about not getting sequential reads without using LevelDB. they switched to lmdb a while back.

-- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@adfin.com

4153

Age (days ago)

4154

Last active (days ago)

openldap-devel@openldap.org

3 comments

3 participants

tags (0)

participants (3)

Howard Chu
Matthew Moskewicz
Milosz Tanski