I am new to key-value stores. I would like to use it like big persistent associative array and I want to be able to use long length keys. For example file path or URL. What technique should I use to implement such dictionary on top of LMDB? Probably some kind of hashing and collision resolution? Or just recompile with MDB_MAXKEYSIZE=2047 or something like this? Or LMDB is a wrong tool for this job?
Alex V. wrote:
I am new to key-value stores. I would like to use it like big persistent associative array and I want to be able to use long length keys. For example file path or URL. What technique should I use to implement such dictionary on top of LMDB? Probably some kind of hashing and collision resolution? Or just recompile with MDB_MAXKEYSIZE=2047 or something like this? Or LMDB is a wrong tool for this job?
Probably some combination of the above. Recompile with larger MAXKEYSIZE will make everything else easier.
Mozilla/Firefox/Seamonkey/whatever stores complete URLs in a history database, and uses the entire URL (including any URL parameters) as the key. This is an extremely wasteful design; first of all the URL params aren't meaningful when you just want to look back and see what sites you visited. Also, when storing keys that are inherently a hierarchical structure, you should store them hierarchically. You should look at the back-mdb code and the back-hdb design paper for how to do this efficiently.
http://www.openldap.org/conf/odd-sfo-2003/howard-dev.html
e.g., when storing a long pathname /some/very/long/path/name you should use a sequence of keys instead of a single key:
01some00 02very01 03long02 04path03 05name04
Each key is composed of its own unique ID, the path component, and the ID of its parent. (There's more to it, read the design paper.)
This saves space, particularly when you have a lot of items within a related hierarchy. e.g. /some/very/long/other/path /some/very/long/path/foo /some/very/long/other/file
The common components of each path only need to be saved once. It takes multiple DB lookups to retrieve the record for a given pathname, but individual lookups are extremely fast because the keys are short.
In SQLightning I used the hashing approach - save the first 56 bytes of a key as-is, for anything beyond that take a hash and append it to the first 56 bytes. This is reasonably fast but loses ordering properties of the keys. My next version will probably decompose keys into components just like the above, for space and speed reasons.
Howard Chu wrote:
Alex V. wrote:
I am new to key-value stores. I would like to use it like big persistent associative array and I want to be able to use long length keys. For example file path or URL. What technique should I use to implement such dictionary on top of LMDB? Probably some kind of hashing and collision resolution? Or just recompile with MDB_MAXKEYSIZE=2047 or something like this? Or LMDB is a wrong tool for this job?
Probably some combination of the above. Recompile with larger MAXKEYSIZE will make everything else easier.
Mozilla/Firefox/Seamonkey/whatever stores complete URLs in a history database, and uses the entire URL (including any URL parameters) as the key. This is an extremely wasteful design;
But for browsers that's the only possible way because they have to deal with everything which is out there.
first of all the URL params aren't meaningful when you just want to look back and see what sites you visited.
It very much depends.
Also, when storing keys that are inherently a hierarchical structure, you should store them hierarchically.
You can only do that if you can make strict assumptions about the URLs' structure. (Of course file pathnames are very simple subset of the problem.) It's not that simple with HTTP URLs in general given the many ways web apps, CMS whatever address data with URLs.
E.g. I saw a IAM tool fail to deal with the random looking URLs used to access Lotus Notes databases through Domino web server which actually can be somewhat hierarchical, but not necessarily the URL.
And one would also have to consider URL rewriting/redirecting in some cases.
So I hope for Alex that his data is much simpler. ;-)
Ciao, Michael.
Michael Ströder wrote:
Howard Chu wrote:
Alex V. wrote:
I am new to key-value stores. I would like to use it like big persistent associative array and I want to be able to use long length keys. For example file path or URL. What technique should I use to implement such dictionary on top of LMDB? Probably some kind of hashing and collision resolution? Or just recompile with MDB_MAXKEYSIZE=2047 or something like this? Or LMDB is a wrong tool for this job?
Probably some combination of the above. Recompile with larger MAXKEYSIZE will make everything else easier.
Mozilla/Firefox/Seamonkey/whatever stores complete URLs in a history database, and uses the entire URL (including any URL parameters) as the key. This is an extremely wasteful design;
But for browsers that's the only possible way because they have to deal with everything which is out there.
first of all the URL params aren't meaningful when you just want to look back and see what sites you visited.
It very much depends.
Also, when storing keys that are inherently a hierarchical structure, you should store them hierarchically.
You can only do that if you can make strict assumptions about the URLs' structure. (Of course file pathnames are very simple subset of the problem.) It's not that simple with HTTP URLs in general given the many ways web apps, CMS whatever address data with URLs.
Nonsense. The domain name component of an HTTP URL is purely hierarchical, and the pathname component of an HTTP URL is also. Anything after the pathname may be considered opaque, for simplicity, but the fact is URL syntax is already well defined and the components are clear-cut.
Howard Chu wrote:
Michael Ströder wrote:
Howard Chu wrote:
Alex V. wrote:
I am new to key-value stores. I would like to use it like big persistent associative array and I want to be able to use long length keys. For example file path or URL. What technique should I use to implement such dictionary on top of LMDB? Probably some kind of hashing and collision resolution? Or just recompile with MDB_MAXKEYSIZE=2047 or something like this? Or LMDB is a wrong tool for this job?
Probably some combination of the above. Recompile with larger MAXKEYSIZE will make everything else easier.
Mozilla/Firefox/Seamonkey/whatever stores complete URLs in a history database, and uses the entire URL (including any URL parameters) as the key. This is an extremely wasteful design;
But for browsers that's the only possible way because they have to deal with everything which is out there.
first of all the URL params aren't meaningful when you just want to look back and see what sites you visited.
It very much depends.
Also, when storing keys that are inherently a hierarchical structure, you should store them hierarchically.
You can only do that if you can make strict assumptions about the URLs' structure. (Of course file pathnames are very simple subset of the problem.) It's not that simple with HTTP URLs in general given the many ways web apps, CMS whatever address data with URLs.
Nonsense.
Not nonsense, rather much experience with web systems.
The domain name component of an HTTP URL is purely hierarchical,
Yes, the hostname (FQDN) is hierarchical.
and the pathname component of an HTTP URL is also.
Maybe in some of your deployments but not in general. For example some web apps have the random document or even session IDs as part of path name without having a hierachical structure.
Anything after the pathname may be considered opaque, for simplicity, but the fact is URL syntax is already well defined and the components are clear-cut.
The URL syntax is well-defined but not the semantics. Note that URLs are not limited to serve only static content.
Ciao, Michael.
Michael Ströder wrote:
Howard Chu wrote:
and the pathname component of an HTTP URL is also.
Maybe in some of your deployments but not in general. For example some web apps have the random document or even session IDs as part of path name without having a hierachical structure.
The fact is if it's in the URL pathname then it's delimited by slashes and you can break it apart at those delimiters. Whether they actually reflect any hierarchical mapping or not is irrelevant, the point is you don't need to store the entire URL+parameters as a single key.
On Wed, Apr 22, 2015 at 3:17 AM, Alex V. alex.v.odesk@gmail.com wrote:
For example file path or URL.
We in KDE[1] have been using LMDB for building a file indexer. In order to store file paths we have also used a sequence of keys since file paths are hierarchical.
* Every file has an id. This is a combination of the device id + inode number. * We map fileId -> (parentId fileName) * We map parentId -> list of sub-file ids
For (id -> file path), multiple lookups are required For (file Path -> id), we can just rely on the file system
This obviously only works for files which exist in the file system. Also with this folder renames are very cheap.
openldap-technical@openldap.org