Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions
0
votes
1 answer

Redis HyperLogLog - Too many errors

The scenario is really simple. I'm adding 50 elements (different each time) to a HLL. Usually at the third time, I get a wrong PFCOUNT (151 instead of 150). I know that the HLL has a low error rate but is it so easy to get a false positive ? can…
lordav
  • 105
  • 1
  • 2
  • 10
0
votes
1 answer

Do we have a way to check if an element already exists in HyperLogLog in Java

I have a use case where I need to check if an element already exists in the Hyperloglog and if not, I need to make a hbase call. Do we have any method IN JAVA to check if element already exists in HyperLogLog
0
votes
1 answer

Druid Default Distinct Approximation Algorithm

Is there a way to modify the default HLL approximation algorithm with ThetaSketch in Druid? So that while querying for count distinct, druid by default uses ThetaSketch instead of HLL.
user2693313
  • 341
  • 3
  • 5
  • 13
0
votes
1 answer

How to expire a HyperLogLog in Redis?

HyperLogLogs take up 12KB of space. I don't see anything in the docs about when that storage is freed up. My current plan is to call EXPIRE every time I call PFADD, but I can't find much discussion about expiring HLLs, so I'm wondering if I'm doing…
mgalgs
  • 15,671
  • 11
  • 61
  • 74
0
votes
1 answer

Is usage analysis based on HyperLogLog compliant with GDPR?

Context: we have telemetry system for our service and would like to track retention, how many users use various features, etc. There are two options to deal with user identifiable information and be GDPR compliant: Support deleting user information…
ZakiMa
  • 5,637
  • 1
  • 24
  • 48
0
votes
1 answer

Using HLL_COUNT.MERGE outside of SQL

I can use the following query to general all the HLL sketches of the distinct counts: SELECT category, count(distinct city), HLL_COUNT.INIT(city) FROM `table` GROUP BY category And I get something like this: While I would normally use the…
David542
  • 104,438
  • 178
  • 489
  • 842
0
votes
1 answer

Can't call `ApproximateDistinct.ApproximateDistinctFn` from ApacheBeam sql

Trying to use aggregate function ApproximateDistinct.ApproximateDistinctFn from apache beam sql, this failed. my SQL: SELECT ApproximateDistinct(user_id) as distinct_count, profile, country_code, FROM PCOLLECTION GROUP BY…
Brachi
  • 637
  • 9
  • 17
0
votes
2 answers

How to save hyperLogLog field to BigQuery from ApacheBeam with Data Flow runner

I need to save HLL sketches into BigQuery from ApacheBeam. I found some extension library for Apache-Beam that does it: But I can't find a way to save the sketch itself to BigQuery. to be able to use it later with merge function and other functions…
0
votes
1 answer

Redis hyperlog keys cardinality is not increasing

When using a hyperlog key in redis master slave 4.0.9 we have a pfcount of 52161862. Now when we add a unique item through pfadd it is returning zero and of count is still 52161862. Any idea why we are not able to add more unique items to this key?
0
votes
1 answer

What is the difference between a probabilistic data structure and a sketch?

According to this StackOverflow question, probabilistic data structures are data structures that give approximate, as opposed to precise, answers. In particular, they have very low time and space complexities and are easily parallelizable, making…
Shuklaswag
  • 1,003
  • 1
  • 10
  • 27
0
votes
2 answers

Presto's support for approx_distinct

I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5%…
Ameya
  • 83
  • 1
  • 9
0
votes
1 answer

Explanation on HyperLogLog algorithm

First of all let me start off by saying that I read this question. So as I was strolling through the internet and I came across that algorithm and I was wondering how it worked. After reading about it I did understand how it counts the views by…
nonerth
  • 549
  • 2
  • 7
  • 19
0
votes
1 answer

Count unique visitors across any time range analytics?

We have a use case where we want to report on the unique visitors in our app across any time range (hour granularity). Example: Suppose at hour 0 we had following visitors {A, B, C, D} and at hour 1 we have {C, D, E, F} , at hour 2 we have {E, F, A,…
0
votes
3 answers

Why is 1 added to the leading zero count in hyperloglog algorithm

If there are k number of leading zeros in the bit pattern of hash, why is the estimate size considered to be 2k+1? shouldn't it be 2k ? the probability of having k leading zero should be 1/(2k) and hence the size should be 2k In my code I always get…
Golak Sarangi
  • 809
  • 7
  • 22
0
votes
1 answer

HyperLogLog Implementation on Redis Not Recognized

I'm trying to run a simple code here which simply inserts a value into a key using the PFADD operation but I get this error: ResponseError: unknown command 'PFADD' My code is as follows: import pandas as pd import redis r =…
Augmented Jacob
  • 1,567
  • 19
  • 45