Howard Chu writes:
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
With variable-length records you can't do a binary search for an entry ID. I don't know how significant that is though. And you can get around it, at some expense:
......01 means 1 vector follows, representing base+ 0-31. ......02 means 1 vector follows, representing base+32-63.
That can be normalized so bottom bit == 1: (base+32, 01). If we also don't use entry IDs with (ID & 31) == 0, we can look at the bottom bit of a word in the IDL to tell if it is a vector header or a vector word.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need).
Though in sparse IDLs, worst case will be 2 words per entry ID.
OTOH if we have exact ranges, best case is 2 words for any number of entries:-) I don't know how they are stored now though, you mentioned loss of precision so I assume approximate ranges are involved but a glance at the code doesn't reveal where.
That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
And some way to tell that from vectors, possibly using up another bit as a flag in the vector header. Alternatively, don't use IDs where (ID & 31) == 2 (or 31) either, and use that bit in the first vector words as a flag.
Do you have any stats or feeling about how sparse the average IDL of any size is? Or how much ranges improve over lists of single IDLs, for that matter? I suppose it matters somewhat in temporary IDLs in memory as well as in files...