Seems like I keep poking at this idea again every couple years.
If we use a <base,vector> we can cover a much wider range without losing any precision. For a 32 bit machine, 32 bits per word would cover 5 bits worth of entry IDs. So 27 bits would comprise the base, and the next 32 bits would represent up to 32 entry IDs.
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
......01 means 1 vector follows, representing base+ 0-31. ......02 means 1 vector follows, representing base+32-63. ......05 means 2 vectors follow, base+0-31 then base+64-95.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need). That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
Is it worth the effort for 32 bit when most production servers are 64? Or am I missing the point?
Gavin.
On 10/19/08, Howard Chu hyc@symas.com wrote:
Seems like I keep poking at this idea again every couple years.
If we use a <base,vector> we can cover a much wider range without losing any precision. For a 32 bit machine, 32 bits per word would cover 5 bits worth of entry IDs. So 27 bits would comprise the base, and the next 32 bits would represent up to 32 entry IDs.
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
......01 means 1 vector follows, representing base+ 0-31. ......02 means 1 vector follows, representing base+32-63. ......05 means 2 vectors follow, base+0-31 then base+64-95.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need). That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Gavin Henry wrote:
Is it worth the effort for 32 bit when most production servers are 64? Or am I missing the point?
The question is the same regardless - that is, whether/how to efficiently represent IDLs without losing too much precision and without adding too much complexity. The difference between 32 and 64 bit here is a trivial detail.
Currently we lose precision whenever any IDL goes above 65535 entries and we never get it back even if the IDL later shrinks to less than 65535 entries. We ought to have a scheme that can represent the entire range of valid entryIDs with more granularity but without using much more memory.
Currently on a 32 bit machine that means we can have 2^32 bytes of memory in the machine but it takes 4x2^32 bytes to losslessly represent every valid value, so we clearly have an efficiency problem. Likewise on a 64 bit machine there could be 2^64 bytes of memory but it would take 8*2^64 bytes for our current representation. Using a straight bitvector with no particular tricks would bring that down to 2^32 bits (2^29 bytes) or 2^64 bits (2^61 bytes) which is still impractical.
Realistically, at about 2Kbytes per entry, a 32 bit machine could directly address at most ~2^20 entries (~1M) so an IDL that maxes out at 1.86M before losing precision may be perfectly acceptable.
For 64 bit machines it seems 50 bits (1 petabyte) is the largest physical address space currently available. That puts a practical upper bound, assuming 4Kbytes per entry, at around 2^37 entries (~128 billion) directly addressable. It would take 2^34 bytes to handle that directly, 16GBytes, which again is still impractical. So again, we need some kind of range representation in addition to the actual bit vectors.
Gavin.
On 10/19/08, Howard Chuhyc@symas.com wrote:
Seems like I keep poking at this idea again every couple years.
If we use a<base,vector> we can cover a much wider range without losing any precision. For a 32 bit machine, 32 bits per word would cover 5 bits worth of entry IDs. So 27 bits would comprise the base, and the next 32 bits would represent up to 32 entry IDs.
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
......01 means 1 vector follows, representing base+ 0-31. ......02 means 1 vector follows, representing base+32-63. ......05 means 2 vectors follow, base+0-31 then base+64-95.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need). That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu wrote:
Realistically, at about 2Kbytes per entry, a 32 bit machine could directly address at most ~2^20 entries (~1M) so an IDL that maxes out at 1.86M before losing precision may be perfectly acceptable.
For 64 bit machines it seems 50 bits (1 petabyte) is the largest physical address space currently available. That puts a practical upper bound, assuming 4Kbytes per entry, at around 2^37 entries (~128 billion) directly addressable. It would take 2^34 bytes to handle that directly, 16GBytes, which again is still impractical. So again, we need some kind of range representation in addition to the actual bit vectors.
I think using <base,vector> with fixed-length vectors will be the way forward. When an IDL reaches the maximum length, and a new <base,vector> needs to be added, the entire IDL will be scaled by 2. On most 32 bit installs this will probably never happen; on 64 bit installs it will happen infrequently.
It may be possible/desirable to simply scale one piece of the IDL at a time. Anything that can coalesce two vectors into one will be sufficient to allow the next insertion to proceed. The scale factor can be stored in the unused bits of the <base>...
Gavin.
On 10/19/08, Howard Chuhyc@symas.com wrote:
Seems like I keep poking at this idea again every couple years.
If we use a<base,vector> we can cover a much wider range without losing any precision. For a 32 bit machine, 32 bits per word would cover 5 bits worth of entry IDs. So 27 bits would comprise the base, and the next 32 bits would represent up to 32 entry IDs.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need). That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
Howard Chu writes:
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
With variable-length records you can't do a binary search for an entry ID. I don't know how significant that is though. And you can get around it, at some expense:
......01 means 1 vector follows, representing base+ 0-31. ......02 means 1 vector follows, representing base+32-63.
That can be normalized so bottom bit == 1: (base+32, 01). If we also don't use entry IDs with (ID & 31) == 0, we can look at the bottom bit of a word in the IDL to tell if it is a vector header or a vector word.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need).
Though in sparse IDLs, worst case will be 2 words per entry ID.
OTOH if we have exact ranges, best case is 2 words for any number of entries:-) I don't know how they are stored now though, you mentioned loss of precision so I assume approximate ranges are involved but a glance at the code doesn't reveal where.
That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
And some way to tell that from vectors, possibly using up another bit as a flag in the vector header. Alternatively, don't use IDs where (ID & 31) == 2 (or 31) either, and use that bit in the first vector words as a flag.
Do you have any stats or feeling about how sparse the average IDL of any size is? Or how much ranges improve over lists of single IDLs, for that matter? I suppose it matters somewhat in temporary IDLs in memory as well as in files...
Hallvard B Furuseth wrote:
Howard Chu writes:
To avoid wasting those 5 unneeded bits in the base, we could use them as a run-length counter and use multiple vectors per base. But it might be better to slide things over and use just a 24 bit base, and use the bottom 8 bits of the word as a bitmask to represent which following vectors are present. E.g.
With variable-length records you can't do a binary search for an entry ID. I don't know how significant that is though. And you can get around it, at some expense:
It may be best to just use fixed size (maximal) vectors instead then. We can probably afford 256 bits of slop here and there. Then, absent any special markers, we only need a 24 bit base + 256 bit vector.
That allows us to represent up to 256 entryIDs in only 288 bits (instead of the 16384 bits we currently need).
So only 280 bits instead of 288...
Though in sparse IDLs, worst case will be 2 words per entry ID.
OTOH if we have exact ranges, best case is 2 words for any number of entries:-) I don't know how they are stored now though, you mentioned loss of precision so I assume approximate ranges are involved but a glance at the code doesn't reveal where.
We don't use lists of ranges; that was one of my original proposals but Jong posted some arguments against that. (Way back in the -devel archives.) Right now, once we hit the 64K IDs limit we collapse the whole thing into a single range, which is why so much precision is lost.
That would allow us to track about 1.86M entryIDs in the same space we currently use to track only 64K entryIDs.
It seems we would still need some kind of range representation to handle IDLs with more than 1.86M entries.
And some way to tell that from vectors, possibly using up another bit as a flag in the vector header. Alternatively, don't use IDs where (ID& 31) == 2 (or 31) either, and use that bit in the first vector words as a flag.
Do you have any stats or feeling about how sparse the average IDL of any size is? Or how much ranges improve over lists of single IDLs, for that matter? I suppose it matters somewhat in temporary IDLs in memory as well as in files...
Sparseness in different indices increases over time...
Consider a back-bdb-style onelevel index. If you're adding entries that have a lot of children, then this onelevel index is always going to be very sparse: 1 2 6 3 4 5 7 8 9
As you add and delete entries, the sparseness increases. It never decreases.
In a substring index, again, the IDLs tend to be quite sparse. In an eq index on objectclass, the IDLs tend to be pretty dense, but it still gets more sparse as adds and deletes occur.