h.b.furuseth@usit.uio.no wrote:
I wrote:
OTOH if you add a bunch of slightly smaller nodes, mdb will put most of them in separate pages anyway without MDB_APPEND.
...because mdb_page_split() has been wasteful since 48ef27b6f5c804eca6a9 "ITS#7385 fix mdb_page_split (again)". When a txn put()s ascending keys with nodes of the same size, the new item goes in the fullest page.
E.g. put data size 1010 with 'int' keys 1,2,3... to an MDB_INTEGERKEY DB. As the txn progresses, (page: #key #key...) get distributed thus:
Page 2: #1. Page 2: #1 #2. Page 2: #1 #2 #3. Page 2: #1. Page 3: #2 #3 #4. Page 2: #1. Page 3: #2. Page 5: #3 #4 #5. Page 2: #1. Page 3: #2. Page 5: #3. Page 6: #4 #5 #6.
2/3 wasted space. Descending put() work better:
Page 2: #6. Page 2: #5 #6. Page 2: #4 #5 #6. Page 2: #3 #4. Page 3: #5 #6. Page 2: #2 #3 #4. Page 3: #5 #6. Page 2: #2 #1. Page 3: #5 #6. Page 5: #3 #4.
Ascending put() with datasize 1348, so only 2 nodes fit in a page:
Page 2: #1. Page 2: #1 #2. Page 2: #1. Page 3: #2 #3. Page 2: #1. Page 3: #2. Page 5: #3 #4.
Half of the space is wasted. Descending order does not help.
page_split() cannot know which split is best in this case. But it'll help to guess that the next put() key sometimes will be near this one, and ensure that the node with the new key will be the smallest. That will also avoid touching the old page when the nodes are that large, since the "split" will keep all old nodes in the old page.
Fixed now in mdb.master. On a large DB, the previous code used 3276587 pages in slapadd -q, and the new code uses 3272633 pages. It's only a 0.13% savings in this case, it seems the frequency of these insert patterns is quite rare. The runtime is also 1.1% faster going from real 41m35.8s user 50m57.6s sys 5m11.4s to real 41m1.7s user 50m25.8s sys 4m55.3s