3

I made my own implementation of HyperLogLog algorithm. It works well, but sometimes I have to fetch a lot (around 10k-100k) of HLL structures and merge them.

I store each of them as a bit string so first I have to convert each bit string to buckets. Since there is a LOT of HLL's it takes more time than I would like it to.

Currently around 80% of runtime takes this line of code called once for each HLL:

my @buckets = map { oct '0b'.$_ } unpack('(a5)1024', $bitstring);

Is there any way to do it faster?

If we leave definition of HyperLogLog behind, task could be explained like this: given that $bitstring consists from 1024 5-bit counters (so each counter could have value up to 32) we have to convert it to array of 1024 integers.

Ria
  • 10,237
  • 3
  • 33
  • 60
skaurus
  • 1,581
  • 17
  • 27
  • would you provide a few examples of $bitstring? also how long does it take to run and what is acceptable? – michael501 May 15 '14 at 19:21
  • @michael, it's a string, as simple as "101011..." etc. Its length is 5120 symbols. – skaurus May 15 '14 at 20:08
  • Just a quick note that there is a cpan module for this: [`Algorithm::HyperLogLog`](https://metacpan.org/pod/Algorithm::HyperLogLog) – Miller May 16 '14 at 19:59
  • @Miller Thanks, didn't knew of it. Funny that it does export and import via files, same way as Bloom::Faster. Because of this I had to make own Bloom filter too. Just what was they thinking? Why not give me a string I could store anywhere I want? My blooms and hlls both use Redis instead of files and this is super convenient given Redis bit operations and Lua scripting (I know that Redis have built-in HLL but they are too big for me). – skaurus May 16 '14 at 20:20

1 Answers1

6

The a denotes arbitrary, zero-padded binary data. Here, you treat that data as ASCII text but it can only contain 1 and 0! This is inefficient as a5 ends up using five bytes. The easiest and most efficient solution would be to store a 8-bit number for each counter, then: my @buckets = unpack 'C1024', $bitstring.

If you only want to store five bits per counter, you end up saving very little memory for very much hassle. You'd have to use something insane like this for a round-trip conversion:

my $bitstring = pack "(b5)1024", map { sprintf "%b", $_ } @buckets;
@buckets = map { oct "0b$_" } unpack "(b5)1024", $bitstring;
amon
  • 57,091
  • 2
  • 89
  • 149
  • 1
    "The easiest and most efficient solution would be to store a 8-bit number for each counter" - that makes sense. I was to say that I could have up to 36k HLL's and that isn't too much so I could add another 40% of memory; but then I realised it's a mistake and actually with current implementation I could have up to hundreds of millions of them :) I should return to the drawing board... %) – skaurus May 15 '14 at 20:20
  • So, one more cent before drawing board - "unpack 'C1024'" is really at least 4 times faster. Thanks! – skaurus May 16 '14 at 06:03