The current SimHashIterator works by finding the most significant active bit, and then calculating the number of SimHash buckets that will exist from that bit to the end of the filter. This means we are iterating (most_significant_bit / SIM_BUCKET_SIZE) + SIM_BUCKETS buckets.
The problem here is that in the higher order bits the sparsity is very high, meaning that there will potentially be many buckets where the values we're mixing into our SimHashes will just be zeros, and given any value XOR 0 is just that value this is wasted computation.
We could move to use the BitChunks iterators which batch up our chunks into a u64 and an index of that block of bits, skipping ranges full of zeros. This will introduce issues where if our SimHash bucket straddles multiple BitChunks, which can be solved by peeking the next chunk, making sure that the index is contiguous with our current chunk.
Right now SIM_BUCKET_SIZE is always 6 bits but if we allowed fully arbitrary SimHash bucket sizes this could get slightly gnarly because we'd need to be able to peek an arbitrary number of BitChunks ahead of the current position.
The following is a line chart plot of the proportion of empty bit_ranges mixed into the SimHashes using pseudorandomly generated geofilters with 0..100_000 items added to them.
The ratio reduces very sharply after a couple of inserts but remains relatively high since newly added items have some chance of landing in a very significant bucket which introduces a load of zero gaps before the next one bit.
The current
SimHashIteratorworks by finding the most significant active bit, and then calculating the number ofSimHashbuckets that will exist from that bit to the end of the filter. This means we are iterating(most_significant_bit / SIM_BUCKET_SIZE) + SIM_BUCKETSbuckets.The problem here is that in the higher order bits the sparsity is very high, meaning that there will potentially be many buckets where the values we're mixing into our
SimHashes will just be zeros, and given any value XOR0is just that value this is wasted computation.We could move to use the
BitChunksiterators which batch up our chunks into au64and an index of that block of bits, skipping ranges full of zeros. This will introduce issues where if ourSimHashbucket straddles multipleBitChunks, which can be solved by peeking the next chunk, making sure that the index is contiguous with our current chunk.Right now
SIM_BUCKET_SIZEis always6bits but if we allowed fully arbitrarySimHashbucket sizes this could get slightly gnarly because we'd need to be able to peek an arbitrary number ofBitChunks ahead of the current position.The following is a line chart plot of the proportion of empty
bit_ranges mixed into the SimHashes using pseudorandomly generated geofilters with0..100_000items added to them.The ratio reduces very sharply after a couple of inserts but remains relatively high since newly added items have some chance of landing in a very significant bucket which introduces a load of zero gaps before the next one bit.