Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions
1
vote
2 answers

What Algorithm is used by java.util.HashSet and java.util.TreeSet to store unique values in its structure?

I have come across multiple algorithms such as Flajolet-Martin algorithm , HyperLogLog to find out unique elements from a list of elements and suddenly became curious about how Java calculates it? And what is the Time-complexity in each of these…
Phenomenal One
  • 2,501
  • 4
  • 19
  • 29
1
vote
1 answer

When should Redis HyperLogLog be avoided and why?

I have some basic ideas of how Redis HyperLogLog works and when to use it. Before using it I did a test: I pfadded some consecutive numbers to an HLL entry (to mimic user ids), and Redis soon gave a false positive result. To be exact, if you pfadd…
adamsmith
  • 5,759
  • 4
  • 27
  • 39
1
vote
1 answer

Merge uniq counters, probabilistic data structures

There are two sets 1 2 3 and 3 4 with 3 and 2 unique items. Now let's calculate unique items in merged set. If we just sum up the counters 3 + 2 = 5 it will be wrong (it should be uniq(1 2 3 3 4) = 4). Is there a way to do it using only the…
Alex Craft
  • 13,598
  • 11
  • 69
  • 133
1
vote
2 answers

How LogLog algorithm with single hash function works

I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How…
VB_
  • 45,112
  • 42
  • 145
  • 293
1
vote
2 answers

Determine percentage of unused keys in large redis DB

I have a Redis database with many millions of keys in it. Over time, the keys that I have written to and read from have changed, and so there are many keys that I am simply not using any more. Most don't have any kind of TTL either. I want to get a…
alec
  • 141
  • 2
  • 11
1
vote
3 answers

Postgresql-hll (or another Hyperloglog data type/structure) for Redshift

Need to be able to report on Unique Visitors, but would like to avoid pre-computing every possible permutation of keys and creating multiple tables. As a simplistic example, let's say I need to report Monthly Uniques in a table that has the…
Sologoub
  • 5,312
  • 6
  • 37
  • 65
1
vote
1 answer

Cardinality approximation for logical set operations – (The "HyperLogLog" for AND/OR/XOR)

we are currently facing an interesting problem. We would like to estimate the cardinality of a set without the need to store every single item (typically bitmaps/bitsets are a nice approach). A very nice algorithm is the so called HyperLogLog…
Fritz
  • 872
  • 1
  • 8
  • 17
1
vote
1 answer

How to get a family of independent universal hash function?

I am trying to implement the hyperloglog counting algorithm using stochastic averaging. To do that, I need many independent universal hash functions to hash items in different substreams. I found that there are only a few hash function available in…
Louis Kuang
  • 727
  • 1
  • 14
  • 30
1
vote
1 answer

How does one store unique "Likes" or "Views" or sets at scale?

I'd like to get some insight into how various companies solve counting/incrementing the number of "likes"/"views"/"retweets" or something similar at scale. At userbases past 50 million monthly active users, I've seen both Redis and Cassandra used to…
nflacco
  • 4,972
  • 8
  • 45
  • 78
1
vote
1 answer

How do you test an implementation of Hyperloglog?

There are so many Hyperloglog implementation out there, but how do you verify / test Hyperloglog implementation? To check it's "accuracy", it's "error" bound behavior? Just throwing some static test cases looks very ineffective. More concrete,…
ETOMG
  • 11
  • 3
1
vote
1 answer

How to migrate hyperloglog key to azure redis

I am trying to migrate an redis hyperloglog key from one server to azure redis service using the MIGRATE command, but as far as i know MIGRATE doesn't support moving key to a redis server which requires authentication. How can i migrate hyperlolog…
Kobynet
  • 983
  • 11
  • 23
1
vote
1 answer

Filtering huge quantities of data with combinations of logical expressions

I have huge quantities of data represented as (for example) - User ID | Gender | Location | Type of User There may be more columns depending on the use case. The location is denoted by a pincode. I recently read about HyperLogLog and the Redis…
frugalcoder
  • 959
  • 2
  • 11
  • 23
0
votes
0 answers

Implementing HLL in python to estimate the cardinality

I'm trying to implement the HLL algorithm in python. I'm using data folders with the format of 13 bytes "\x00\x01\x02\x03\x05\x06\x07\x08\x09\x0a\x0b\x0c",described as follows: srcIP = "\x00\x01\x02\x03" srcPort = "\x04\x05" dstIP =…
Ella
  • 19
0
votes
1 answer

Has a single HyperLogLog the same accuracy than merging several ones?

If I create a HyperLogLog per day to count unique visitors, and then the 1st of January I merge the last 365 ones will I get the same value than if I keep a single HyperLogLog for the whole 365 days? I guess not. But how different would those values…
vtscop
  • 31
  • 11
0
votes
1 answer

If HyperLogLog in Redis does not store the actual members but only count, how does PFMERGE work?

Does HyperLogLog store the actual members or only the count of members it is storing? If it is not storing the actual members, how does PFMERGE know which element to merge as count of 1 even when they are repeated across multiple HyperLogLog PFADD…
Ankit Sahay
  • 1,710
  • 8
  • 14