Issues arising from creating powerdns backend based on LMDB

List overview All Threads
Download

newer

older

Re: OpenLDAP 2.4.36 available

How to check whether OpenLDAP...

Mark Zealey

22 Aug 2013 22 Aug '13

11:59 a.m.

Hi Howard, I've now got lmdb working with powerdns in place of kyoto - nice and easy to do thanks! Maximum DNS query load is a little better - about 10-30% depending on use-case, but for us the main gain is that you can have a writer going on at the same time - I was struggling a bit with how to push updates from a different process using kyoto. There's a few issues and things I'd like to comment on though:

1) Can you update documentation to explain what happens when I do a mdb_cursor_del() ? I am assuming it advances the cursor to the next record (this seems to be the behaviour). However there is some sort of bug with this assumption. Basically I have a loop which jumps (MDB_SET_RANGE) to a key and then wants to do a delete until key is like something else. So I do while(..) { mdb_cursor_del(), mdb_cursor_get(..., MDB_GET_CURRENT)}. This works fine mostly, but roughly 1% of the time I get EINVAL returned when I try to MDB_GET_CURRENT after a delete. This always seems to happen on the same records - not sure about the memory structure but could it be something to do with hitting a page boundary somehow invalidating the cursor? At the moment I just catch that and then do an MDB_NEXT to skip over them but this will be an issue for us on live. This is from perl so it /may/ be that, or the version of lmdb that is shipped with it however the perl layer is a very thin wrapper and looking at the code I can only think it comes from lmdb.

2) Currently, because kyoto cabinet didn't have support for multiple identical keys we don't use the DUP options. This leads to quite long keys (1200-1300 bytes in some cases). In the future, it would be nice to have a run-time keylength specifier or something along those lines.

3) Perhaps a mdb_cursor_get_key() function (like kyoto) which doesn't return the data (just the key). As in (2) we store all the data in the key - not sure how much of a performance difference this would make though

4) Creating database with non-sequential keys is very bad (on 4gb databases, 2* slower than kyoto - about 1h30 and uses more memory). I spent quite a bit of time looking at this in perl and then C. Basically I create a database, open 1 txn and then insert a bunch of unordered keys. Up to about 500mb it's fine and nice and quick - from perl about 75k inserts/sec (slow mostly because it's reading from mysql). However after than first 500mb it starts flushing to disk. In a sequential insert case the flush is very quick - 100-200mb/sec or so. However on non-sequential insert I've seen it drop to like 4 or 5mb/sec as it's writing data all over the disk rather than big sequential writes. iostat shows the same ~200tps of write, 100% usage but only 4-10mb/sec of bytes being written.

However, even when it's not flushing (or when storing data on SSD or memdisk), after the first 500mb performance massively drops off to perhaps 10-15k inserts/sec. At the same time, looking at `top`, once the residential memory hits about 500mb, the 'shared memory' starts being used and residential just keeps on increasing. I'm not sure if this is some kind of kernel accounting thing to do with mmap usage but it doesn't happen for sequential key inserts (for those, shared mem stays around 0, residential stays 500mb). I'm using centos 6 with various different kernels from default to 3.7.5 and the behaviour is the same. I don't really know how to go about looking for the root cause of this but I'm pretty sure that whilst the IO is crippling it in places there is something else going on of which the shared memory increase is a sign. I've tried using the WRITEMAP option too which doesn't seem to affect anything significantly in terms of performance or memory usage.

5) pkgconfig/rpms would be really nice to have. Or do you expect it to just be bundled with a project as eg the perl module does?

Thanks,

Mark

Show replies by date

Quanah Gibson-Mount

22 Aug 22 Aug

12:25 p.m.

--On Thursday, August 22, 2013 9:59 PM +0300 Mark Zealey spam@markandruth.co.uk wrote:

...

pkgconfig/rpms would be really nice to have. Or do you expect it to

just be bundled with a project as eg the perl module does?

Debian and FreeBSD both have packages of the latest release 0.9.7. But no, the OpenLDAP Foundation never provides distribution specific packages.

--Quanah

Quanah Gibson-Mount Lead Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Howard Chu

1:37 p.m.

Mark Zealey wrote:

...

Hi Howard, I've now got lmdb working with powerdns in place of kyoto - nice and easy to do thanks! Maximum DNS query load is a little better - about 10-30% depending on use-case, but for us the main gain is that you can have a writer going on at the same time - I was struggling a bit with how to push updates from a different process using kyoto. There's a few issues and things I'd like to comment on though:

Can you update documentation to explain what happens when I do a

mdb_cursor_del() ? I am assuming it advances the cursor to the next record (this seems to be the behaviour). However there is some sort of bug with this assumption. Basically I have a loop which jumps (MDB_SET_RANGE) to a key and then wants to do a delete until key is like something else. So I do while(..) { mdb_cursor_del(), mdb_cursor_get(..., MDB_GET_CURRENT)}. This works fine mostly, but roughly 1% of the time I get EINVAL returned when I try to MDB_GET_CURRENT after a delete. This always seems to happen on the same records - not sure about the memory structure but could it be something to do with hitting a page boundary somehow invalidating the cursor?

That's exactly what it does, yes.

...

At the moment I just catch that and then do an MDB_NEXT to skip over them but this will be an issue for us on live. This is from perl so it /may/ be that, or the version of lmdb that is shipped with it however the perl layer is a very thin wrapper and looking at the code I can only think it comes from lmdb.

Currently, because kyoto cabinet didn't have support for multiple

identical keys we don't use the DUP options. This leads to quite long keys (1200-1300 bytes in some cases). In the future, it would be nice to have a run-time keylength specifier or something along those lines.

I don't foresee that ever happening. The max keysize will always be constrained such that two nodes fit on a page. But we've added the get_maxkeysize() function so that in the future we can increase the limit, there's really no technical reason why it needs to be stuck at 511 bytes.

...

Perhaps a mdb_cursor_get_key() function (like kyoto) which doesn't

return the data (just the key). As in (2) we store all the data in the key - not sure how much of a performance difference this would make though

Two answers: In mdb_cursor_get, the data param can be NULL if you don't want the data. Also, since LMDB is zero-copy, all it's doing is storing a pointer value anyway, so the cost difference of returning the data is pretty much nil.

...

Creating database with non-sequential keys is very bad (on 4gb

databases, 2* slower than kyoto - about 1h30 and uses more memory). I spent quite a bit of time looking at this in perl and then C. Basically I create a database, open 1 txn and then insert a bunch of unordered keys. Up to about 500mb it's fine and nice and quick - from perl about 75k inserts/sec (slow mostly because it's reading from mysql). However after than first 500mb it starts flushing to disk. In a sequential insert case the flush is very quick - 100-200mb/sec or so. However on non-sequential insert I've seen it drop to like 4 or 5mb/sec as it's writing data all over the disk rather than big sequential writes. iostat shows the same ~200tps of write, 100% usage but only 4-10mb/sec of bytes being written.

However, even when it's not flushing (or when storing data on SSD or memdisk), after the first 500mb performance massively drops off to perhaps 10-15k inserts/sec. At the same time, looking at `top`, once the residential memory hits about 500mb, the 'shared memory' starts being used and residential just keeps on increasing. I'm not sure if this is some kind of kernel accounting thing to do with mmap usage but it doesn't happen for sequential key inserts (for those, shared mem stays around 0, residential stays 500mb). I'm using centos 6 with various different kernels from default to 3.7.5 and the behaviour is the same. I don't really know how to go about looking for the root cause of this but I'm pretty sure that whilst the IO is crippling it in places there is something else going on of which the shared memory increase is a sign. I've tried using the WRITEMAP option too which doesn't seem to affect anything significantly in terms of performance or memory usage.

None of the memory behavior you just described makes any sense to me. LMDB uses a shared memory map, exclusively. All of the memory growth you see in the process should be shared memory. If it's anywhere else then I'm pretty sure you have a memory leak. With all the valgrind sessions we've run I'm also pretty sure that *we* don't have a memory leak.

As for the random I/O, it also seems a bit suspect. Are you doing a commit on every key, or batching multiple keys per commit?

...

pkgconfig/rpms would be really nice to have. Or do you expect it to

just be bundled with a project as eg the perl module does?

The OpenLDAP Project releases source code, period. Distros do whatever they do. FreeBSD and Debian have LMDB packages now; if you want RPMs I suggest you ask your distro provider.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

1:58 p.m.

On 22/08/13 23:37, Howard Chu wrote:

...

...

Can you update documentation to explain what happens when I do a

mdb_cursor_del() ? I am assuming it advances the cursor to the next record (this seems to be the behaviour). However there is some sort of bug with this assumption. Basically I have a loop which jumps (MDB_SET_RANGE) to a key and then wants to do a delete until key is like something else. So I do while(..) { mdb_cursor_del(), mdb_cursor_get(..., MDB_GET_CURRENT)}. This works fine mostly, but roughly 1% of the time I get EINVAL returned when I try to MDB_GET_CURRENT after a delete. This always seems to happen on the same records - not sure about the memory structure but could it be something to do with hitting a page boundary somehow invalidating the cursor?

That's exactly what it does, yes.

Any idea about the EINVAL issue?

...

None of the memory behavior you just described makes any sense to me. LMDB uses a shared memory map, exclusively. All of the memory growth you see in the process should be shared memory. If it's anywhere else then I'm pretty sure you have a memory leak. With all the valgrind sessions we've run I'm also pretty sure that *we* don't have a memory leak.

As for the random I/O, it also seems a bit suspect. Are you doing a commit on every key, or batching multiple keys per commit?

I'm not doing *any* commits just one big txn for all the data...

The below C works fine up until i=4m (ie 500mb of residential memory shown in top), then has massive slowdown, shared memory (again, as seen in top) increases, waits about 20-30 seconds and then disks get hammered writing 10mb/sec (200txns) when they are capable of 100-200mb/sec streaming writes... Does it do the same for you?

int main(int argc,char * argv[]) { int i = 0, j = 0, rc; MDB_env *env; MDB_dbi dbi; MDB_val key, data; MDB_txn *txn; char buf[40]; int count = 100000000;

rc = mdb_env_create(&env); rc = mdb_env_set_mapsize(env, (size_t)1024*1024*1024*10); rc = mdb_env_open(env, "./testdb", 0, 0664); rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi);

for (i=0;i<count;i++) { sprintf( buf, "blah foo %9d%9d%9d", (long)(random() * (float)count / RAND_MAX) - i, i, i ); if( i %100000 == 0 ) printf("%s\n", buf); key.mv_size = sizeof(buf); key.mv_data = &buf; data.mv_size = sizeof(buf); data.mv_data = &buf; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi);

mdb_env_close(env);

return 0; }

By the way, I've just generated our biggest database (~4.5gb) from scratch using our standard perl script. Using kyoto (treedb) with various tunings it did it in 18 min real time vs lmdb at 50 minutes (both ssd-backed in a box with 24gb free memory).

Mark

Howard Chu

2:07 p.m.

Mark Zealey wrote:

...

On 22/08/13 23:37, Howard Chu wrote:

...
...

Can you update documentation to explain what happens when I do a

mdb_cursor_del() ? I am assuming it advances the cursor to the next record (this seems to be the behaviour). However there is some sort of bug with this assumption. Basically I have a loop which jumps (MDB_SET_RANGE) to a key and then wants to do a delete until key is like something else. So I do while(..) { mdb_cursor_del(), mdb_cursor_get(..., MDB_GET_CURRENT)}. This works fine mostly, but roughly 1% of the time I get EINVAL returned when I try to MDB_GET_CURRENT after a delete. This always seems to happen on the same records - not sure about the memory structure but could it be something to do with hitting a page boundary somehow invalidating the cursor?

That's exactly what it does, yes.

Any idea about the EINVAL issue?

Yes, as I said already, it does exactly what you said. When you've deleted the last item on the page the cursor no longer points at a valid node, so GET_CURRENT returns EINVAL.

...

...
None of the memory behavior you just described makes any sense to me. LMDB uses a shared memory map, exclusively. All of the memory growth you see in the process should be shared memory. If it's anywhere else then I'm pretty sure you have a memory leak. With all the valgrind sessions we've run I'm also pretty sure that *we* don't have a memory leak.

As for the random I/O, it also seems a bit suspect. Are you doing a commit on every key, or batching multiple keys per commit?

I'm not doing *any* commits just one big txn for all the data...

The below C works fine up until i=4m (ie 500mb of residential memory shown in top), then has massive slowdown, shared memory (again, as seen in top) increases, waits about 20-30 seconds and then disks get hammered writing 10mb/sec (200txns) when they are capable of 100-200mb/sec streaming writes... Does it do the same for you?

int main(int argc,char * argv[]) { int i = 0, j = 0, rc; MDB_env *env; MDB_dbi dbi; MDB_val key, data; MDB_txn *txn; char buf[40]; int count = 100000000;
      rc = mdb_env_create(&env);
      rc = mdb_env_set_mapsize(env, (size_t)1024*1024*1024*10);
      rc = mdb_env_open(env, "./testdb", 0, 0664);
      rc = mdb_txn_begin(env, NULL, 0, &txn);
      rc = mdb_open(txn, NULL, 0, &dbi);

      for (i=0;i<count;i++) {
          sprintf( buf, "blah foo %9d%9d%9d", (long)(random() *
(float)count / RAND_MAX) - i, i, i ); if( i %100000 == 0 ) printf("%s\n", buf); key.mv_size = sizeof(buf); key.mv_data = &buf; data.mv_size = sizeof(buf); data.mv_data = &buf; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi);
      mdb_env_close(env);

  return 0;
}

By the way, I've just generated our biggest database (~4.5gb) from scratch using our standard perl script. Using kyoto (treedb) with various tunings it did it in 18 min real time vs lmdb at 50 minutes (both ssd-backed in a box with 24gb free memory).

Kyoto writes async by default. You should do the same here, use MDB_NOSYNC on the env_open.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

2:31 p.m.

On 23/08/13 00:07, Howard Chu wrote:

...

... as I said already, it does exactly what you said. When you've deleted the last item on the page the cursor no longer points at a valid node, so GET_CURRENT returns EINVAL.

OK I think that's a pretty big gotcha - it would be great to either make the cursor behaviour consistent (ie automatically skip to next record at end of a page) or to very clearly document this! It's pretty different from kyoto/bdb interfaces in that (to my mind at least) it requires a bit too much knowledge from the developer about the underlying structure of the database which should be abstracted by the interface.

...

...
I'm not doing *any* commits just one big txn for all the data...

The below C works fine up until i=4m (ie 500mb of residential memory shown in top), then has massive slowdown, shared memory (again, as seen in top) increases, waits about 20-30 seconds and then disks get hammered writing 10mb/sec (200txns) when they are capable of 100-200mb/sec streaming writes... Does it do the same for you? ...

Kyoto writes async by default. You should do the same here, use MDB_NOSYNC on the env_open.

MDB_NOSYNC makes no difference in my test case above - seeing exactly the same memory, speed and disk patterns. Are you able to reproduce it?

Mark

Howard Chu

3:19 p.m.

Mark Zealey wrote:

...

...
...
I'm not doing *any* commits just one big txn for all the data...

The below C works fine up until i=4m (ie 500mb of residential memory shown in top), then has massive slowdown, shared memory (again, as seen in top) increases, waits about 20-30 seconds and then disks get hammered writing 10mb/sec (200txns) when they are capable of 100-200mb/sec streaming writes... Does it do the same for you? ...

Kyoto writes async by default. You should do the same here, use MDB_NOSYNC on the env_open.

MDB_NOSYNC makes no difference in my test case above - seeing exactly the same memory, speed and disk patterns. Are you able to reproduce it?

Yes, I see it here, and I see the problem. LMDB was not originally designed to handle transactions of unlimited size. It originally had a txn sizelimit of about 512MB. In 0.9.7 we added some code to raise this limit, and it's performing quite poorly here. I've tweaked my copy of the code to alleviate that problem but your test program still fails here because the volume of data being written also exceeds the map size. You were able to run this to completion?

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

6:55 p.m.

Howard Chu wrote:

...

Mark Zealey wrote:

...
...
...
I'm not doing *any* commits just one big txn for all the data...

The below C works fine up until i=4m (ie 500mb of residential memory shown in top), then has massive slowdown, shared memory (again, as seen in top) increases, waits about 20-30 seconds and then disks get hammered writing 10mb/sec (200txns) when they are capable of 100-200mb/sec streaming writes... Does it do the same for you? ...

Kyoto writes async by default. You should do the same here, use MDB_NOSYNC on the env_open.

MDB_NOSYNC makes no difference in my test case above - seeing exactly the same memory, speed and disk patterns. Are you able to reproduce it?

Yes, I see it here, and I see the problem. LMDB was not originally designed to handle transactions of unlimited size. It originally had a txn sizelimit of about 512MB. In 0.9.7 we added some code to raise this limit, and it's performing quite poorly here. I've tweaked my copy of the code to alleviate that problem but your test program still fails here because the volume of data being written also exceeds the map size. You were able to run this to completion?

Two things... I've committed a patch to mdb.master to help this case out. It sped up my run of your program, using only 10M records, from 19min to 7min.

Additionally, if you change your test program to commit every 2M records, and avoid running into the large txn situation, then the 10M records are stored in only 1m51s.

Running it now with the original 100M count. Will see how it goes.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

23 Aug 23 Aug

12:46 a.m.

On 23/08/13 04:55, Howard Chu wrote:

...

Howard Chu wrote:

...
Yes, I see it here, and I see the problem. LMDB was not originally designed to handle transactions of unlimited size. It originally had a txn sizelimit of about 512MB. In 0.9.7 we added some code to raise this limit, and it's performing quite poorly here. I've tweaked my copy of the code to alleviate that problem but your test program still fails here because the volume of data being written also exceeds the map size. You were able to run this to completion?

Two things... I've committed a patch to mdb.master to help this case out. It sped up my run of your program, using only 10M records, from 19min to 7min.

Additionally, if you change your test program to commit every 2M records, and avoid running into the large txn situation, then the 10M records are stored in only 1m51s.

Running it now with the original 100M count. Will see how it goes.

I never actually ran it through (hence the map size issue) it was more just an unlimited number to investigate the slowdown - 10M seems fine. I just pulled from git (assumed this was better than the patch you sent) and rebuilt, certainly seems a bit better now although at around 6m records (ext4) it has some awful IO - drops to 1mb/sec in places on our normal disk (first few writes are 100mb/s then it starts writing all over the place). I've tried on both ext4 and xfs with no special tuning and pretty much the same thing happens although closer to 7m records on xfs. This is with NOSYNC option too. If I set the commit gap to 1m records performance is ok up to around 8.4m records on ext4 and then just stops for a minute or two doing small writes. Same thing at about 9.4m. It seems that the patch has pushed the performance dropoff back a bit and perhaps improved on it but there is still an issue there as far as I can see.

The test program with 10m records committing every 1m completes in 1m10s user time, but 5m30s real time because of all the pausing for disk writes (ext4 but as above doesn't seem to make much difference compared to xfs)... Same program&latest git on an SSD-backed system (ie massive number of small write transactions don't cause any issues) with slightly faster CPU - user time 47sec, real time 1min. On the SSD-backed box without any commits - 5m30s user time, 6min real time.

So committing every 1-2m records is much better. I don't mind using short transactions (in fact the program doesn't actually need any transactions). Perhaps it would be good to have a "Allow LMDB to automatically commit+reopen this transaction for optimal performance" flag, or some way of easily knowing when the txn should be committed and reopened rather than trying to guess roughly how many bytes i've written since the last txn and commit if > a magic number of 400mb?

Also I don't know how intentional the 512mb limit you mention is but perhaps that could be set at runtime - in that way I could just set to half the box's mem size and ensure I don't need to write anything until I have the whole thing generated?

By the way, looking at `free` output seems to imply that `top` is lying about how much memory the program is using - residential looks like it is capped at 500mb but it keeps rising along with shared which is presumably the pages in the mmap that are in memory at the moment.

wrt the ssd vs hdd performance differences, I did see similar disk write issues in kyoto. So for that we generate onto a memdisk, however it seems a bit strange to have to do this with LMDB given it's advertised as a memory database.

Mark

Howard Chu

3:57 a.m.

Mark Zealey wrote:

...

On 23/08/13 04:55, Howard Chu wrote:

...
Howard Chu wrote:

...
Yes, I see it here, and I see the problem. LMDB was not originally designed to handle transactions of unlimited size. It originally had a txn sizelimit of about 512MB. In 0.9.7 we added some code to raise this limit, and it's performing quite poorly here. I've tweaked my copy of the code to alleviate that problem but your test program still fails here because the volume of data being written also exceeds the map size. You were able to run this to completion?

Two things... I've committed a patch to mdb.master to help this case out. It sped up my run of your program, using only 10M records, from 19min to 7min.

Additionally, if you change your test program to commit every 2M records, and avoid running into the large txn situation, then the 10M records are stored in only 1m51s.

Running it now with the original 100M count. Will see how it goes.

I never actually ran it through (hence the map size issue) it was more just an unlimited number to investigate the slowdown - 10M seems fine. I just pulled from git (assumed this was better than the patch you sent) and rebuilt, certainly seems a bit better now although at around 6m records (ext4) it has some awful IO - drops to 1mb/sec in places on our normal disk (first few writes are 100mb/s then it starts writing all over the place). I've tried on both ext4 and xfs with no special tuning and pretty much the same thing happens although closer to 7m records on xfs. This is with NOSYNC option too. If I set the commit gap to 1m records performance is ok up to around 8.4m records on ext4 and then just stops for a minute or two doing small writes. Same thing at about 9.4m. It seems that the patch has pushed the performance dropoff back a bit and perhaps improved on it but there is still an issue there as far as I can see.

Agreed, it's still fairly slow. I reran the 100M using commits at 100,000 and it finished in 18m26s.

...

The test program with 10m records committing every 1m completes in 1m10s user time, but 5m30s real time because of all the pausing for disk writes (ext4 but as above doesn't seem to make much difference compared to xfs)... Same program&latest git on an SSD-backed system (ie massive number of small write transactions don't cause any issues) with slightly faster CPU - user time 47sec, real time 1min. On the SSD-backed box without any commits - 5m30s user time, 6min real time.

So committing every 1-2m records is much better. I don't mind using short transactions (in fact the program doesn't actually need any transactions). Perhaps it would be good to have a "Allow LMDB to automatically commit+reopen this transaction for optimal performance" flag, or some way of easily knowing when the txn should be committed and reopened rather than trying to guess roughly how many bytes i've written since the last txn and commit if > a magic number of 400mb?

Also I don't know how intentional the 512mb limit you mention is but perhaps that could be set at runtime - in that way I could just set to half the box's mem size and ensure I don't need to write anything until I have the whole thing generated?

By the way, looking at `free` output seems to imply that `top` is lying about how much memory the program is using - residential looks like it is capped at 500mb but it keeps rising along with shared which is presumably the pages in the mmap that are in memory at the moment.

Yes, the shared memory is included in the rss, it's quite deceptive especially if you have multiple processes using shared memory.

...

wrt the ssd vs hdd performance differences, I did see similar disk write issues in kyoto. So for that we generate onto a memdisk, however it seems a bit strange to have to do this with LMDB given it's advertised as a memory database.

LMDB is *not* advertised as a "memory database" - it is advertised as a memory-mapped disk database. It is only people who have no clue what they're talking about who refer to it as a "memory database". Memory databases have no persistence and are limited to the size of RAM. LMDB has neither of those traits. Being a disk-based DB means we're affected by issues like disk seek time.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

3:13 a.m.

...

Creating database with non-sequential keys is very bad (on 4gb

databases, 2* slower than kyoto - about 1h30 and uses more memory).

This was actually a typo - kyoto only takes about 20 minutes to generate it so 4* slower. However, using a commit every 1m inserts (and because of a limitation in the perl module also have to close the DB/env and reopen it) and backing it to a memdisk (which we also have to do with kyoto), it takes about 10% less time than kyoto. Doing it against normal disk is still very slow though. Size for a 4gb database was about 10% more than kyoto, for kyoto outputting a 1.5gb database lmdb did 2.5gb though. Doesn't matter too much for our purposes however.

Mark

Howard Chu

4:02 a.m.

Mark Zealey wrote:

...

...

Creating database with non-sequential keys is very bad (on 4gb

databases, 2* slower than kyoto - about 1h30 and uses more memory).

This was actually a typo - kyoto only takes about 20 minutes to generate it so 4* slower. However, using a commit every 1m inserts (and because

Try again with commit every 100K inserts.

...

of a limitation in the perl module also have to close the DB/env and reopen it) and backing it to a memdisk (which we also have to do with kyoto), it takes about 10% less time than kyoto. Doing it against normal disk is still very slow though. Size for a 4gb database was about 10% more than kyoto, for kyoto outputting a 1.5gb database lmdb did 2.5gb though. Doesn't matter too much for our purposes however.

Look at mdb_stat -ef on the resulting DB, you'll see that a large amount of pages claimed on disk are actually free pages in the DB. Larger commits leave more old pages behind than smaller commits.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

5:02 a.m.

On 23/08/13 14:02, Howard Chu wrote:

...

Mark Zealey wrote:

...
...

Creating database with non-sequential keys is very bad (on 4gb

databases, 2* slower than kyoto - about 1h30 and uses more memory).

This was actually a typo - kyoto only takes about 20 minutes to generate it so 4* slower. However, using a commit every 1m inserts (and because

Try again with commit every 100K inserts.

Thanks, that's much quicker again - for commit every 30k or 100k inserts time is about 12m30s. This is probably more limited by out pulling the data now. Of course this is still using memdisk

I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty. However in other cases with duplicates both key and value are returned. Also, using MDB_NEXT_DUP seems to return key/value in all cases. Could this be another feature caused by a page boundary ? (I'm assuming here that MDB_NEXT is meant to go through dups as well although the docs don't specifically say that)

Mark

Howard Chu

5:36 a.m.

Mark Zealey wrote:

...

On 23/08/13 14:02, Howard Chu wrote:

...
Mark Zealey wrote:

...
...

Creating database with non-sequential keys is very bad (on 4gb

databases, 2* slower than kyoto - about 1h30 and uses more memory).

This was actually a typo - kyoto only takes about 20 minutes to generate it so 4* slower. However, using a commit every 1m inserts (and because

Try again with commit every 100K inserts.

Thanks, that's much quicker again - for commit every 30k or 100k inserts time is about 12m30s. This is probably more limited by out pulling the data now. Of course this is still using memdisk

I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

...

However in other cases with duplicates both key and value are returned. Also, using MDB_NEXT_DUP seems to return key/value in all cases. Could this be another feature caused by a page boundary ? (I'm assuming here that MDB_NEXT is meant to go through dups as well although the docs don't specifically say that)

Next means next. The docs shouldn't have to say anything more specific than that.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

6:36 a.m.

...

...
I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

Attached .c shows it - create 3 keys with 5 entries under each. Actually my report was incorrect - cursor_get() with MDB_NEXT or MDB_NEXT_DUP never seems to set the key unless it is the first entry read... Perhaps this is intended?!

Mark

Howard Chu

7:08 a.m.

Mark Zealey wrote:

...

...
...
I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

Attached .c shows it - create 3 keys with 5 entries under each. Actually my report was incorrect - cursor_get() with MDB_NEXT or MDB_NEXT_DUP never seems to set the key unless it is the first entry read... Perhaps this is intended?!

Yes and no. It was intended for NEXT_DUP because, since it's a duplicate, you already know what the key is. It is unintended for NEXT, for the opposite reason, and in this case it's a bug.

...

Mark

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

8:47 a.m.

On 23/08/13 17:08, Howard Chu wrote:

...

Mark Zealey wrote:

...
...
...
I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

Attached .c shows it - create 3 keys with 5 entries under each. Actually my report was incorrect - cursor_get() with MDB_NEXT or MDB_NEXT_DUP never seems to set the key unless it is the first entry read... Perhaps this is intended?!

Yes and no. It was intended for NEXT_DUP because, since it's a duplicate, you already know what the key is. It is unintended for NEXT, for the opposite reason, and in this case it's a bug.

It would be nice to have it for NEXT_DUP as well to be honest - I have a function that gets called for each record and it would be good not have to save state between calls.

Mark

Howard Chu

8:54 a.m.

Mark Zealey wrote:

...

On 23/08/13 17:08, Howard Chu wrote:

...
Mark Zealey wrote:

...
...
...
I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

Attached .c shows it - create 3 keys with 5 entries under each. Actually my report was incorrect - cursor_get() with MDB_NEXT or MDB_NEXT_DUP never seems to set the key unless it is the first entry read... Perhaps this is intended?!

Yes and no. It was intended for NEXT_DUP because, since it's a duplicate, you already know what the key is. It is unintended for NEXT, for the opposite reason, and in this case it's a bug.

It would be nice to have it for NEXT_DUP as well to be honest - I have a function that gets called for each record and it would be good not have to save state between calls.

See latest mdb.master.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Mark Zealey

2:22 p.m.

On 23/08/13 18:54, Howard Chu wrote:

...

Mark Zealey wrote:

...
On 23/08/13 17:08, Howard Chu wrote:

...
Mark Zealey wrote:

...
...
...
I've found another weird - I have now converted the database to use duplicates. Typically when I do mdb_cursor_get(... MDB_NEXT ) it will set the key and value but I've found 1 place so far where I do it and on the duplicate's second entry the value is set but the key is empty.

I don't see how this can happen; the only time we don't return the key is if some operation actually failed. Can you send test code to reproduce this?

Attached .c shows it - create 3 keys with 5 entries under each. Actually my report was incorrect - cursor_get() with MDB_NEXT or MDB_NEXT_DUP never seems to set the key unless it is the first entry read... Perhaps this is intended?!

Yes and no. It was intended for NEXT_DUP because, since it's a duplicate, you already know what the key is. It is unintended for NEXT, for the opposite reason, and in this case it's a bug.

It would be nice to have it for NEXT_DUP as well to be honest - I have a function that gets called for each record and it would be good not have to save state between calls.

See latest mdb.master.

Looks great thanks for the quick fix!

Mark

4330

Age (days ago)

4331

Last active (days ago)

openldap-technical@openldap.org

18 comments

3 participants

tags (0)

participants (3)

Howard Chu
Mark Zealey
Quanah Gibson-Mount