Hallvard B Furuseth wrote:
Howard Chu writes:
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
With variable-length records you can't do a binary search for an entry ID. I don't know how significant that is though. And you can get around it, at some expense:
It may be best to just use fixed size (maximal) vectors instead then. We can probably afford 256 bits of slop here and there. Then, absent any special markers, we only need a 24 bit base + 256 bit vector.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need).
So only 280 bits instead of 288...
Though in sparse IDLs, worst case will be 2 words per entry ID.
OTOH if we have exact ranges, best case is 2 words for any number of entries:-) I don't know how they are stored now though, you mentioned loss of precision so I assume approximate ranges are involved but a glance at the code doesn't reveal where.
We don't use lists of ranges; that was one of my original proposals but Jong posted some arguments against that. (Way back in the -devel archives.) Right now, once we hit the 64K IDs limit we collapse the whole thing into a single range, which is why so much precision is lost.
That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
And some way to tell that from vectors, possibly using up another bit as a flag in the vector header. Alternatively, don't use IDs where (ID& 31) == 2 (or 31) either, and use that bit in the first vector words as a flag.
Do you have any stats or feeling about how sparse the average IDL of any size is? Or how much ranges improve over lists of single IDLs, for that matter? I suppose it matters somewhat in temporary IDLs in memory as well as in files...
Sparseness in different indices increases over time...
Consider a back-bdb-style onelevel index. If you're adding entries that have a lot of children, then this onelevel index is always going to be very sparse: 1 2 6 3 4 5 7 8 9
As you add and delete entries, the sparseness increases. It never decreases.
In a substring index, again, the IDLs tend to be quite sparse. In an eq index on objectclass, the IDLs tend to be pretty dense, but it still gets more sparse as adds and deletes occur.