Hi,
I wanted to share some thoughts and findings and perhaps collect some comments about a small test I did: I used LMDB vs. SQLite3 to perform indexing of a file tree.
In this scenario, we're capturing file paths, along with some metadata (size, mtime, and hashes, let's use only sha1 for this example), and we want to be able to efficiently walk the tree, as well as resolve a file's info by id or by path.
I'm showing the SQL schema first, as it helps formalizing the "NoSQL" case:
CREATE TABLE IF NOT EXISTS files ( id INTEGER PRIMARY KEY AUTOINCREMENT, sha1 BLOB, size INTEGER, mtime INTEGER );
CREATE TABLE IF NOT EXISTS pathsegments ( id INTEGER PRIMARY KEY AUTOINCREMENT, parent_id INTEGER, file_id INTEGER, segment_name TEXT, FOREIGN KEY(parent_id) REFERENCES pathsegments(id), FOREIGN KEY(file_id) REFERENCES files(id) );
CREATE INDEX IF NOT EXISTS idx_sha1 ON files (sha1);
CREATE INDEX IF NOT EXISTS idx_fileid2segment ON pathsegments (file_id);
CREATE INDEX IF NOT EXISTS idx_parent_id2segment ON pathsegments (parent_id);
Capturing the full path of each entry as a string was not something I wanted, and was also wasteful in the few datasets I tested it on. So we're using path segments, and a full path can be reconstructed; in the walking situation, from root to leaves, or if we have a file ID and want its path, from the leaves to the root, using the fileid2segment index and the parent information. And if files get moved around, we can alter the database without rewriting many paths.
With lmdb, as I was conscious that each database would "intrinsically" build an index, I packed the file_id->{mtime,size,hash} as fid2attrs and segment_id->{parent_id,name} as sid2kids databases, and we still have segment_id->kids DUPSORT sid2kids for doing asciibetical readdir like the SQL idx_parent_id2segment.
So we have our 2 tables with 2+3 indexes in SQL, and 4 DBs in lmdb (fid2attrs, sid2kids, sid2dirname, hash2fid).
I implemented this (in Python) with sqlite and lmdb (using COBS for serialization, for simplicity and because it has the nice property of keeping serialized varint unsigned integers with preserved ordering), and went on to build dbs from a dataset of 474782 small files and 45014 directories (which I've coincidentally uploaded at https://archive.org/download/ftp.modland.com_pub_modules).
Well, I was surprised that the SQLite3 file was weighing 67,149,824 bytes, and the lmdb one was larger at 82,718,720 bytes.
In particular, looking at database statistics, it seems that the hash to id index/DB had unexpected (for me) overhead of ~ 112% to payload size with lmdb (and it was using less payload), with a size of 24932352 bytes for lmdb's DB vs. 15298560 for SQLite's index;
SQLite statistics:
Percentage of total database...................... 22.8% Number of entries................................. 474782 Bytes of storage consumed......................... 15298560 Bytes of payload.................................. 12311437 80.5% Bytes of metadata................................. 1469162 9.6% B-tree depth...................................... 3 Average payload per entry......................... 25.93 Average unused bytes per entry.................... 3.20 Average metadata per entry........................ 3.09 Average fanout.................................... 109.00 Non-sequential pages.............................. 2986 80.0% Maximum payload per entry......................... 26 Entries that use overflow......................... 0 0.0% Index pages used.................................. 34 Primary pages used................................ 3701 Overflow pages used............................... 0 Total pages used.................................. 3735 Unused bytes on index pages....................... 16998 12.2% Unused bytes on primary pages..................... 1500963 9.9% Unused bytes on overflow pages.................... 0 Unused bytes on all pages......................... 1517961 9.9%
LMDB: depth=3 branch_pages=66 leaf_pages=6021 overflow_pages=0 Keys size: 9495640 Values size: 2261081 Payload size: 11756721 Pages size: 24932352 Average key size: 20.000 (20-20) Average value size: 4.762 (1-5) Average entry size: 24.762 Entries number: 474782 Overhead : 13175631 Overhead/payload ratio: 1.121 Overhead/entry avg: 27.751
I checked the overhead for in-order insertions and it's still around 50% for this key/value sizes, much more than SQLite does (~21%) for random insertions.
But even if we were to remove this index, SQLite's file would still be tighter...
That's it!
Best regards,
Jérôme Carretero wrote:
Hi,
I wanted to share some thoughts and findings and perhaps collect some comments about a small test I did: I used LMDB vs. SQLite3 to perform indexing of a file tree.
In this scenario, we're capturing file paths, along with some metadata (size, mtime, and hashes, let's use only sha1 for this example), and we want to be able to efficiently walk the tree, as well as resolve a file's info by id or by path.
I'm showing the SQL schema first, as it helps formalizing the "NoSQL" case:
CREATE TABLE IF NOT EXISTS files ( id INTEGER PRIMARY KEY AUTOINCREMENT, sha1 BLOB, size INTEGER, mtime INTEGER );
CREATE TABLE IF NOT EXISTS pathsegments ( id INTEGER PRIMARY KEY AUTOINCREMENT, parent_id INTEGER, file_id INTEGER, segment_name TEXT, FOREIGN KEY(parent_id) REFERENCES pathsegments(id), FOREIGN KEY(file_id) REFERENCES files(id) );
CREATE INDEX IF NOT EXISTS idx_sha1 ON files (sha1);
CREATE INDEX IF NOT EXISTS idx_fileid2segment ON pathsegments (file_id);
CREATE INDEX IF NOT EXISTS idx_parent_id2segment ON pathsegments (parent_id);
Capturing the full path of each entry as a string was not something I wanted, and was also wasteful in the few datasets I tested it on. So we're using path segments, and a full path can be reconstructed; in the walking situation, from root to leaves, or if we have a file ID and want its path, from the leaves to the root, using the fileid2segment index and the parent information. And if files get moved around, we can alter the database without rewriting many paths.
Yes, that's the correct approach. This is how back-hdb, back-mdb, and back-ndb are structured as well.
With lmdb, as I was conscious that each database would "intrinsically" build an index, I packed the file_id->{mtime,size,hash} as fid2attrs and segment_id->{parent_id,name} as sid2kids databases, and we still have segment_id->kids DUPSORT sid2kids for doing asciibetical readdir like the SQL idx_parent_id2segment.
So we have our 2 tables with 2+3 indexes in SQL, and 4 DBs in lmdb (fid2attrs, sid2kids, sid2dirname, hash2fid).
In OpenLDAP, fileID and segmentID are the same thing. And sid2kids / sid2dirname are a single table. You can read about how that's done in the back-hdb paper https://openldap.org/conf/odd-sfo-2003/proceedings.html
I implemented this (in Python) with sqlite and lmdb (using COBS for serialization, for simplicity and because it has the nice property of keeping serialized varint unsigned integers with preserved ordering), and went on to build dbs from a dataset of 474782 small files and 45014 directories (which I've coincidentally uploaded at https://archive.org/download/ftp.modland.com_pub_modules).
Well, I was surprised that the SQLite3 file was weighing 67,149,824 bytes, and the lmdb one was larger at 82,718,720 bytes.
That shouldn't be surprising, since SQLite3 is an update-in-place design and LMDB is copy-on-write.
In particular, looking at database statistics, it seems that the hash to id index/DB had unexpected (for me) overhead of ~ 112% to payload size with lmdb (and it was using less payload), with a size of 24932352 bytes for lmdb's DB vs. 15298560 for SQLite's index;
Any normal LMDB record has 8 byte overhead per record: 4 byte value length, 2 byte key length, and 2 byte flags. Most SQLite3 records have very little per-record overhead because they're fixed-width columns.
Since your fid2attrs and hash2id records are fixed size, you can use MDB_DUPFIXED to eliminate all of the per-record overhead of those tables. (Don't use varints for these, that would prevent using DUPFIXED and would completely defeat the point.)
SQLite statistics:
Percentage of total database...................... 22.8% Number of entries................................. 474782 Bytes of storage consumed......................... 15298560 Bytes of payload.................................. 12311437 80.5% Bytes of metadata................................. 1469162 9.6% B-tree depth...................................... 3 Average payload per entry......................... 25.93 Average unused bytes per entry.................... 3.20 Average metadata per entry........................ 3.09 Average fanout.................................... 109.00 Non-sequential pages.............................. 2986 80.0% Maximum payload per entry......................... 26 Entries that use overflow......................... 0 0.0% Index pages used.................................. 34 Primary pages used................................ 3701 Overflow pages used............................... 0 Total pages used.................................. 3735 Unused bytes on index pages....................... 16998 12.2% Unused bytes on primary pages..................... 1500963 9.9% Unused bytes on overflow pages.................... 0 Unused bytes on all pages......................... 1517961 9.9%
Looks like SQLite3 uses a much higher page fill factor than textbook B+trees. Normally for a random-insert case, pages are split when they're full, leaving you two pages at about 50% utilization each. So on average a B+tree's pages would be 75% full, with only 25% unused bytes. This is what LMDB does for random inserts.
LMDB: depth=3 branch_pages=66 leaf_pages=6021 overflow_pages=0 Keys size: 9495640 Values size: 2261081 Payload size: 11756721 Pages size: 24932352 Average key size: 20.000 (20-20) Average value size: 4.762 (1-5) Average entry size: 24.762 Entries number: 474782 Overhead : 13175631 Overhead/payload ratio: 1.121 Overhead/entry avg: 27.751
I checked the overhead for in-order insertions and it's still around 50% for this key/value sizes, much more than SQLite does (~21%) for random insertions.
But even if we were to remove this index, SQLite's file would still be tighter...
That's it!
Best regards,
openldap-technical@openldap.org