Hi, I have a database with about 1 million integer keys. Now I want to append a lot of small values (each 3 bytes) to random keys very often (tens of millions of times). How do I insert into the database the fastest? Currently I am trying the following setup, but I wonder if that is optimal, because I see a decrease in indexing speed after about 500000 random writes (here appends): - I have split the whole key range into 16 databases (each within a different lmdb environment), so each range-db holds about 65000 keys - I open each range-db with the flags: MDB_APPENDDUP | MDB_INTEGERKEY | MDB_INTEGERDUP | MDB_DUPSORT - then when I want to append 3 bytes to the value of a key, I use the following call: res = mdb_put( txn[dbnumber], db[dbnumber], &k, &v, MDB_APPENDDUP ); Thank you for your help! -Tom
Tom Kirchner wrote:
Hi,
I have a database with about 1 million integer keys. Now I want to append a lot of small values (each 3 bytes) to random keys very often (tens of millions of times). How do I insert into the database the fastest?
Currently I am trying the following setup, but I wonder if that is optimal, because I see a decrease in indexing speed after about 500000 random writes (here appends):
- I have split the whole key range into 16 databases (each within a different lmdb environment), so each range-db holds about 65000 keys
- I open each range-db with the flags: MDB_APPENDDUP | MDB_INTEGERKEY | MDB_INTEGERDUP | MDB_DUPSORT
- then when I want to append 3 bytes to the value of a key, I use the following call: res = mdb_put( txn[dbnumber], db[dbnumber], &k, &v, MDB_APPENDDUP );
You will get fastest throughput by doing as much sequential activity as possible. I.e., process all of the values for a single key before moving on to a different key.
As documented, INTEGERKEY/INTEGERDUP are meant for native integers of the host machine. There are no CPUs in existence that natively use 3-byte integers. You should just use 4-bytes; note that LMDB pads odd-sized records internally anyway so using 3-byte values isn't going to save you any space.
If all your values are the same size you should also use DUPFIXED - that will save a significant amount of overhead.
Thank you for your help! -Tom
Tom kirchner wrote:
- I have split the whole key range into 16 databases (each within a different lmdb environment), so each range-db holds
about 65000 keys
Is it even worth it to split the keys into different databases? Is 1mio keys even a large number where one would expect a significant performance improvement when splitting the key range? If the b+tree depth (perfectly balanced) would be 4.8 for 65000 keys, its 6 for 1mio keys - not much deeper (if I calculated correctly).
Howard Chu wrote:
You will get fastest throughput by doing as much sequential activity as possible. I.e., process all of the values for a single key before moving on to a different key.
I will try to do that but thats difficult in my case. I read that dublicate key's values are appended to the same binary blob of that key. Is that correct? Is it maybe worth a try to index the data in batches: index 500000 values (using append to dublicate key) into a temp db and then sequentially writing the values from the tmp db to the final db?
Howard Chu wrote:
If all your values are the same size you should also use DUPFIXED - that will save a significant amount of overhead.
Thanks for the hint, I must have overlooked it in the documentation. Will try that!
I read the documentation of lmdb at http://symas.com/mdb/doc/index.html but sometimes I find it hard to get my head around the descriptions of the various flags one can use (e.g. for db open and put/get), especially MDB_INTEGERKEY and the sort. Some additional explanations might be helpful, such as "How is the key size for fixed-sized integer keys set?". Other than that I find the documentation very thourough. Maybe a glossary might also be helpful.
-Tom
openldap-technical@openldap.org