Hi all,
First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something.
I'm considering using LMDB (versus LevelDB) for a project I'm working on where I'll be receiving a high volume (hundreds per second) of high priority requests (over HTTP) and issuing multiple (<10) database queries per request.
I'll also have a separate process receiving updates for the data and writing to the database. This will happen often (several times a minute, perhaps), but the priority is much lower than the read requests.
LMDB appealed to me because of the read performance and that I could have one processing reading data from LMDB and another process writing data updates to LMDB.
For proof of concept, I hacked up the following (I'll use pseudocode since I used the Go bindings for my actual programs, and hopefully my question is sufficiently abstract not to matter):
Process 1, the writer, simply writes a random integer (from 0 to 1000) to a defined set of keys:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = env.BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit()
txn = env.BeginTxn(nil, 0) n_entries = 5 for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = sprintf("Val-%d", rand.Int(1000)) txn.Put(dbi, key, val, 0) } txn.Commit() env.DBIClose(dbi) env.Close()
Process 2, the reader, simply loops forever and does random access reads on the data from process 1 (I won't benefit from a cursor for my actual problem), and prints out that data occasionally:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) while { txn = BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit() for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } env.DBIClose(dbi) sleep(5) }
So my high level question is: What am I doing wrong? This seems to work OK, but a lot of it was guesswork, so I'm sure I'm doing some silly things.
For example, first I put the BeginTxn() and DBIOpen() calls in process 2 outside of the while loop, but when I did that, I never saw the updates values upon running process 1 simultaneously. In my real-world application, it seems like adding these calls to every request (to be sure the data being read is up-to-date) could be an unnecessary performance penalty.
I was suspect there are flags that I can/should be using, but I'm not sure.
Thanks for any input.
Brian
Brian G. Merrell wrote:
Hi all,
First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something.
I'm considering using LMDB (versus LevelDB) for a project I'm working on where I'll be receiving a high volume (hundreds per second) of high priority requests (over HTTP) and issuing multiple (<10) database queries per request.
I'll also have a separate process receiving updates for the data and writing to the database. This will happen often (several times a minute, perhaps), but the priority is much lower than the read requests.
LMDB appealed to me because of the read performance and that I could have one processing reading data from LMDB and another process writing data updates to LMDB.
For proof of concept, I hacked up the following (I'll use pseudocode since I used the Go bindings for my actual programs, and hopefully my question is sufficiently abstract not to matter):
Process 1, the writer, simply writes a random integer (from 0 to 1000) to a defined set of keys:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = env.BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit()
txn = env.BeginTxn(nil, 0) n_entries = 5 for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = sprintf("Val-%d", rand.Int(1000)) txn.Put(dbi, key, val, 0) } txn.Commit() env.DBIClose(dbi) env.Close()
Process 2, the reader, simply loops forever and does random access reads on the data from process 1 (I won't benefit from a cursor for my actual problem), and prints out that data occasionally:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) while { txn = BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit() for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } env.DBIClose(dbi) sleep(5) }
So my high level question is: What am I doing wrong? This seems to work OK, but a lot of it was guesswork, so I'm sure I'm doing some silly things.
Your reader process should be using read transactions.
For example, first I put the BeginTxn() and DBIOpen() calls in process 2 outside of the while loop, but when I did that, I never saw the updates values upon running process 1 simultaneously. In my real-world application, it seems like adding these calls to every request (to be sure the data being read is up-to-date) could be an unnecessary performance penalty.
In the actual LMDB API read transactions can be reused by their creating thread, so they are zero-cost after the first time. I don't know if any of the other language wrappers leverage this fact.
Opening a DBI only needs to be done once per process. Opening per transaction would be stupid, like reopening a file handle on every request.
I was suspect there are flags that I can/should be using, but I'm not sure.
Thanks for any input.
Brian
On Wed, Jun 4, 2014 at 10:22 AM, Howard Chu hyc@symas.com wrote:
Brian G. Merrell wrote:
Hi all,
First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something.
Thanks. I did see and skim the API portion of the docs before asking, but I was just having trouble knowing how the pieces fit together to solve a problem.
I'm considering using LMDB (versus LevelDB) for a project I'm working on
where I'll be receiving a high volume (hundreds per second) of high priority requests (over HTTP) and issuing multiple (<10) database queries per request.
I'll also have a separate process receiving updates for the data and writing to the database. This will happen often (several times a minute, perhaps), but the priority is much lower than the read requests.
LMDB appealed to me because of the read performance and that I could have one processing reading data from LMDB and another process writing data updates to LMDB.
For proof of concept, I hacked up the following (I'll use pseudocode since I used the Go bindings for my actual programs, and hopefully my question is sufficiently abstract not to matter):
Process 1, the writer, simply writes a random integer (from 0 to 1000) to
a defined set of keys:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = env.BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit()
txn = env.BeginTxn(nil, 0) n_entries = 5 for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = sprintf("Val-%d", rand.Int(1000)) txn.Put(dbi, key, val, 0) } txn.Commit() env.DBIClose(dbi) env.Close()
Process 2, the reader, simply loops forever and does random access reads on the data from process 1 (I won't benefit from a cursor for my actual problem), and prints out that data occasionally:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) while { txn = BeginTxn(nil, 0) dbi = txn.DBIOpen(nil, 0) txn.Commit() for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } env.DBIClose(dbi) sleep(5) }
So my high level question is: What am I doing wrong? This seems to work OK, but a lot of it was guesswork, so I'm sure I'm doing some silly things.
Your reader process should be using read transactions.
OK, I interpret this as meaning that I need to pass the MDB_RDONLY flag to mdb_txn_begin. Is that correct?
For example, first I put the BeginTxn() and DBIOpen() calls in process 2
outside of the while loop, but when I did that, I never saw the updates values upon running process 1 simultaneously. In my real-world application, it seems like adding these calls to every request (to be sure the data being read is up-to-date) could be an unnecessary performance penalty.
In the actual LMDB API read transactions can be reused by their creating thread, so they are zero-cost after the first time. I don't know if any of the other language wrappers leverage this fact.
This helps a lot. I will investigate what the case is with gomdb.
Opening a DBI only needs to be done once per process. Opening per transaction would be stupid, like reopening a file handle on every request.
I suspected so. The fact that mdb_dbi_open takes a transaction had me confused a bit, because I thought I would need to pass in the new transaction every time I got a transaction from mdb_txn_begin.
I've refactored the reader to look like this:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg dbi = txn.DBIOpen(nil, 0) txn.Abort()
while { txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } txn.Commit() sleep(5) } env.DBIClose(dbi)
Now, I guess the big question that BeginTxn inside the loop is zero-cost.
Thanks for the tips so far Howard; it has been very helpful.
I was suspect there are flags that I can/should be using, but I'm not
sure.
Thanks for any input.
Brian
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Brian
Brian G. Merrell wrote:
On Wed, Jun 4, 2014 at 10:22 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Brian G. Merrell wrote: Hi all, First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something. http://symas.com/mdb/doc/
Thanks. I did see and skim the API portion of the docs before asking, but I was just having trouble knowing how the pieces fit together to solve a problem.
Skimming isn't going to cut it.
Your reader process should be using read transactions.
OK, I interpret this as meaning that I need to pass the MDB_RDONLY flag to mdb_txn_begin. Is that correct?
Yes.
In the actual LMDB API read transactions can be reused by their creating thread, so they are zero-cost after the first time. I don't know if any of the other language wrappers leverage this fact.
This helps a lot. I will investigate what the case is with gomdb.
Opening a DBI only needs to be done once per process. Opening per transaction would be stupid, like reopening a file handle on every request.
I suspected so. The fact that mdb_dbi_open takes a transaction had me confused a bit, because I thought I would need to pass in the new transaction every time I got a transaction from mdb_txn_begin.
mdb_dbi_open takes a txn because it needs one if you're creating a DB for the first time. I.e., it must write metadata for the DB into the environment, and all writes to MDB must be inside a txn. But once that txn is committed, the DBI itself lives on until mdb_dbi_close. This is all already explained in the doc for mdb_dbi_open; if you hadn't skimmed you would have seen it already.
Most of this is only a concern when you're using named subDBs. The default unnamed DB always exists, so its DBI is always valid anyway.
I've refactored the reader to look like this:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg dbi = txn.DBIOpen(nil, 0) txn.Abort()
You want mdb_txn_reset() here, not abort. Abort frees/destroys the txn handle so it cannot be reused.
while { txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg
and here you want mdb_txn_renew(), to reuse the txn handle instead of creating a new one.
for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } txn.Commit()
and you want mdb_txn_reset() here too, not commit. Commit also frees/destroys the txn handle.
sleep(5)
}
You can abort or commit the txn during your process teardown phase to dispose of it.
env.DBIClose(dbi)
Now, I guess the big question that BeginTxn inside the loop is zero-cost.
Thanks for the tips so far Howard; it has been very helpful.
On Wed, Jun 4, 2014 at 1:04 PM, Howard Chu hyc@symas.com wrote:
Brian G. Merrell wrote:
On Wed, Jun 4, 2014 at 10:22 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Brian G. Merrell wrote: Hi all, First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something. http://symas.com/mdb/doc/
Thanks. I did see and skim the API portion of the docs before asking, but I was just having trouble knowing how the pieces fit together to solve a problem.
Skimming isn't going to cut it.
Fair enough, I probably gave up prematurely. Blame my inferior intellect, but with zero other context into LMDB, I was having trouble getting a holistic view of LMDB from the docs. From the information you've shared, though, it's made the docs much more approachable. For whatever it's worth, I plan to write something up with my findings that will hopefully help someone.
Your reader process should be using read transactions.
OK, I interpret this as meaning that I need to pass the MDB_RDONLY flag to mdb_txn_begin. Is that correct?
Yes.
In the actual LMDB API read transactions can be reused by their creating thread, so they are zero-cost after the first time. I don't know if any of the other language wrappers leverage this fact.
This helps a lot. I will investigate what the case is with gomdb.
Opening a DBI only needs to be done once per process. Opening per transaction would be stupid, like reopening a file handle on every request.
I suspected so. The fact that mdb_dbi_open takes a transaction had me confused a bit, because I thought I would need to pass in the new transaction every time I got a transaction from mdb_txn_begin.
mdb_dbi_open takes a txn because it needs one if you're creating a DB for the first time. I.e., it must write metadata for the DB into the environment, and all writes to MDB must be inside a txn. But once that txn is committed, the DBI itself lives on until mdb_dbi_close. This is all already explained in the doc for mdb_dbi_open; if you hadn't skimmed you would have seen it already.
Most of this is only a concern when you're using named subDBs. The default unnamed DB always exists, so its DBI is always valid anyway.
I will probably use named subDBs for my real application (instead of 9 separate databases like I do in LevelDB), so thanks for sharing.
I've refactored the reader to look like this:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg dbi = txn.DBIOpen(nil, 0) txn.Abort()
You want mdb_txn_reset() here, not abort. Abort frees/destroys the txn handle so it cannot be reused.
while { txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg
and here you want mdb_txn_renew(), to reuse the txn handle instead of creating a new one.
Ahah! Thank you. I had tried this before, but because I had used the txn.Abort() above, things did not go well. Now my benchmark times are back to what I would expect. I.e., they are comparable to the performance I was seeing when I had all transaction code outside of the loop (but wasn't seeing the data being updated after running my writer process).
for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } txn.Commit()
and you want mdb_txn_reset() here too, not commit. Commit also frees/destroys the txn handle.
sleep(5)
}
You can abort or commit the txn during your process teardown phase to dispose of it.
env.DBIClose(dbi)
Now, I guess the big question that BeginTxn inside the loop is zero-cost.
Thanks for the tips so far Howard; it has been very helpful.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Thanks again Howard for the help early on. I'm into the development of my real application now, and I'm able to leverage the docs much better.
With my application, I have built a (~650MB) database with (9) sub-databases. I'll have a writer application to do updates, which should be pretty easy. There are a couple of things I want to make sure that I'm doing correctly for the reader:
My application is a web service written in Go, using Go's net/http package, which creates a new "goroutine" for each incoming request. goroutines run concurrently, but may be multiplexed onto a single OS thread. So, I will be using the MDB_NOTLS flag when opening the environment. Then--from what I can gather--it seems like I will need to allocate a pool of read-only transactions if I want to avoid allocating new transactions for each HTTP request (is that right?). Something like the following:
/* Test this to figure out how many are needed to never run out in practice */ N_READERS = 512
txn = env.BeginTxn(nil, MDB_RDONLY) // mdb_txn_begin: parent=nil, flags=MDB_RDONLY for each dbname in dbnames { txn.DBIOpen(dbname, 0) // mdb_dbi_open: name=dbname, flags=0 } txn.Commit()
for i = 0; i < N_READERS; i++ { txn = env.BeginTxn(nil, MDB_RDONLY) txnPool.Add(txn) }
Then, for each HTTP request, I would pull a txn out of the pool, use it (for multiple sequential queries for a given HTTP request), reset it, renew it, and put it back in the pool.
I've got a proof of concept working with the above strategy, but does this all sound sane?
Thanks, Brian
P.S. Sorry about the previous non-plaintext e-mails sent to the list. Somehow my e-mail client reverted to silly mode.
On Wed, Jun 4, 2014 at 2:43 PM, Brian G. Merrell bgmerrell@gmail.com wrote:
On Wed, Jun 4, 2014 at 1:04 PM, Howard Chu hyc@symas.com wrote:
Brian G. Merrell wrote:
On Wed, Jun 4, 2014 at 10:22 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Brian G. Merrell wrote: Hi all, First, I'm having trouble finding resources to answer a question like this myself, so please forgive me if I've missed something. http://symas.com/mdb/doc/
Thanks. I did see and skim the API portion of the docs before asking, but I was just having trouble knowing how the pieces fit together to solve a problem.
Skimming isn't going to cut it.
Fair enough, I probably gave up prematurely. Blame my inferior intellect, but with zero other context into LMDB, I was having trouble getting a holistic view of LMDB from the docs. From the information you've shared, though, it's made the docs much more approachable. For whatever it's worth, I plan to write something up with my findings that will hopefully help someone.
Your reader process should be using read transactions.
OK, I interpret this as meaning that I need to pass the MDB_RDONLY flag to mdb_txn_begin. Is that correct?
Yes.
In the actual LMDB API read transactions can be reused by their creating thread, so they are zero-cost after the first time. I don't know if any of the other language wrappers leverage this fact.
This helps a lot. I will investigate what the case is with gomdb.
Opening a DBI only needs to be done once per process. Opening per transaction would be stupid, like reopening a file handle on every request.
I suspected so. The fact that mdb_dbi_open takes a transaction had me confused a bit, because I thought I would need to pass in the new transaction every time I got a transaction from mdb_txn_begin.
mdb_dbi_open takes a txn because it needs one if you're creating a DB for the first time. I.e., it must write metadata for the DB into the environment, and all writes to MDB must be inside a txn. But once that txn is committed, the DBI itself lives on until mdb_dbi_close. This is all already explained in the doc for mdb_dbi_open; if you hadn't skimmed you would have seen it already.
Most of this is only a concern when you're using named subDBs. The default unnamed DB always exists, so its DBI is always valid anyway.
I will probably use named subDBs for my real application (instead of 9 separate databases like I do in LevelDB), so thanks for sharing.
I've refactored the reader to look like this:
env = NewEnv() env.Open("/tmp/foo", 0, 0664) txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg dbi = txn.DBIOpen(nil, 0) txn.Abort()
You want mdb_txn_reset() here, not abort. Abort frees/destroys the txn handle so it cannot be reused.
while { txn = BeginTxn(nil, mdb.RDONLY) // parent txn is the nil arg
and here you want mdb_txn_renew(), to reuse the txn handle instead of creating a new one.
Ahah! Thank you. I had tried this before, but because I had used the txn.Abort() above, things did not go well. Now my benchmark times are back to what I would expect. I.e., they are comparable to the performance I was seeing when I had all transaction code outside of the loop (but wasn't seeing the data being updated after running my writer process).
for i = 0; i < n_entries; i++ { key = sprintf("Key-%d", i) val = txn.Get(dbi, key) print("%s: %s", key, value) } txn.Commit()
and you want mdb_txn_reset() here too, not commit. Commit also frees/destroys the txn handle.
sleep(5)
}
You can abort or commit the txn during your process teardown phase to dispose of it.
env.DBIClose(dbi)
Now, I guess the big question that BeginTxn inside the loop is zero-cost.
Thanks for the tips so far Howard; it has been very helpful.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Brian G. Merrell wrote:
Thanks again Howard for the help early on. I'm into the development of my real application now, and I'm able to leverage the docs much better.
With my application, I have built a (~650MB) database with (9) sub-databases. I'll have a writer application to do updates, which should be pretty easy. There are a couple of things I want to make sure that I'm doing correctly for the reader:
My application is a web service written in Go, using Go's net/http package, which creates a new "goroutine" for each incoming request. goroutines run concurrently, but may be multiplexed onto a single OS thread. So, I will be using the MDB_NOTLS flag when opening the environment. Then--from what I can gather--it seems like I will need to allocate a pool of read-only transactions if I want to avoid allocating new transactions for each HTTP request (is that right?). Something like the following:
/* Test this to figure out how many are needed to never run out in practice */ N_READERS = 512
txn = env.BeginTxn(nil, MDB_RDONLY) // mdb_txn_begin: parent=nil, flags=MDB_RDONLY for each dbname in dbnames { txn.DBIOpen(dbname, 0) // mdb_dbi_open: name=dbname, flags=0 } txn.Commit()
for i = 0; i < N_READERS; i++ { txn = env.BeginTxn(nil, MDB_RDONLY) txnPool.Add(txn) }
Then, for each HTTP request, I would pull a txn out of the pool, use it (for multiple sequential queries for a given HTTP request), reset it, renew it, and put it back in the pool.
I've got a proof of concept working with the above strategy, but does this all sound sane?
Almost. Reset the reader txn when it's not being used. Renew it just before use. So in your case, always reset it before adding it to the pool.
It is unfortunate that you're using a system like Go that multiplexes on top of OS threads. Your pool is going to require locks to manage access, and will be a bottleneck. In a conventional threaded program, thread-local storage can be accessed for free.
openldap-technical@openldap.org