Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions

votes

1 answer

Redis HyperLogLog - Too many errors

The scenario is really simple. I'm adding 50 elements (different each time) to a HLL. Usually at the third time, I get a wrong PFCOUNT (151 instead of 150). I know that the HLL has a low error rate but is it so easy to get a false positive ? can…

redis hyperloglog

asked Jan 26 '22 at 16:03

lordav

votes

1 answer

Do we have a way to check if an element already exists in HyperLogLog in Java

I have a use case where I need to check if an element already exists in the Hyperloglog and if not, I need to make a hbase call. Do we have any method IN JAVA to check if element already exists in HyperLogLog

java hyperloglog

asked Nov 22 '21 at 14:09

Pranjal Tripathi

votes

1 answer

Druid Default Distinct Approximation Algorithm

Is there a way to modify the default HLL approximation algorithm with ThetaSketch in Druid? So that while querying for count distinct, druid by default uses ThetaSketch instead of HLL.

druid hyperloglog

asked Sep 03 '20 at 15:39

user2693313

votes

1 answer

How to expire a HyperLogLog in Redis?

HyperLogLogs take up 12KB of space. I don't see anything in the docs about when that storage is freed up. My current plan is to call EXPIRE every time I call PFADD, but I can't find much discussion about expiring HLLs, so I'm wondering if I'm doing…

redis hyperloglog

asked Aug 21 '19 at 23:45

mgalgs

15,671
11
61
74

votes

1 answer

Is usage analysis based on HyperLogLog compliant with GDPR?

Context: we have telemetry system for our service and would like to track retention, how many users use various features, etc. There are two options to deal with user identifiable information and be GDPR compliant: Support deleting user information…

hyperloglog

asked Jul 12 '19 at 05:41

ZakiMa

5,637
1
24
48

votes

1 answer

Using HLL_COUNT.MERGE outside of SQL

I can use the following query to general all the HLL sketches of the distinct counts: SELECT category, count(distinct city), HLL_COUNT.INIT(city) FROM `table` GROUP BY category And I get something like this: While I would normally use the…

algorithm google-bigquery hyperloglog

asked May 25 '19 at 01:30

David542

104,438
178
489
842

votes

1 answer

Can't call `ApproximateDistinct.ApproximateDistinctFn` from ApacheBeam sql

Trying to use aggregate function ApproximateDistinct.ApproximateDistinctFn from apache beam sql, this failed. my SQL: SELECT ApproximateDistinct(user_id) as distinct_count, profile, country_code, FROM PCOLLECTION GROUP BY…

google-cloud-dataflow apache-beam hyperloglog beam-sql

asked Apr 08 '19 at 10:18

Brachi

votes

2 answers

How to save hyperLogLog field to BigQuery from ApacheBeam with Data Flow runner

I need to save HLL sketches into BigQuery from ApacheBeam. I found some extension library for Apache-Beam that does it: But I can't find a way to save the sketch itself to BigQuery. to be able to use it later with merge function and other functions…

java google-bigquery google-cloud-dataflow apache-beam hyperloglog

asked Apr 04 '19 at 15:27

Brachi

votes

1 answer

Redis hyperlog keys cardinality is not increasing

When using a hyperlog key in redis master slave 4.0.9 we have a pfcount of 52161862. Now when we add a unique item through pfadd it is returning zero and of count is still 52161862. Any idea why we are not able to add more unique items to this key?

redis hyperloglog

asked Feb 25 '19 at 11:27

Neeraj Bhatt

votes

1 answer

What is the difference between a probabilistic data structure and a sketch?

According to this StackOverflow question, probabilistic data structures are data structures that give approximate, as opposed to precise, answers. In particular, they have very low time and space complexities and are easily parallelizable, making…

data-structures approximation bloom-filter hyperloglog

asked Jul 09 '18 at 00:44

Shuklaswag

1,003
1
10
27

votes

2 answers

Presto's support for approx_distinct

I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5%…

presto approximate hyperloglog

asked Aug 14 '17 at 12:36

Ameya

votes

1 answer

Explanation on HyperLogLog algorithm

First of all let me start off by saying that I read this question. So as I was strolling through the internet and I came across that algorithm and I was wondering how it worked. After reading about it I did understand how it counts the views by…

database algorithm hyperloglog

asked May 27 '17 at 21:23

nonerth

votes

1 answer

Count unique visitors across any time range analytics?

We have a use case where we want to report on the unique visitors in our app across any time range (hour granularity). Example: Suppose at hour 0 we had following visitors {A, B, C, D} and at hour 1 we have {C, D, E, F} , at hour 2 we have {E, F, A,…

hadoop apache-spark hive bloom-filter hyperloglog

asked Apr 05 '17 at 02:30

Girish Subramanian

votes

3 answers

Why is 1 added to the leading zero count in hyperloglog algorithm

If there are k number of leading zeros in the bit pattern of hash, why is the estimate size considered to be 2k+1? shouldn't it be 2k ? the probability of having k leading zero should be 1/(2k) and hence the size should be 2k In my code I always get…

algorithm hyperloglog

asked Feb 13 '17 at 02:59

Golak Sarangi

votes

1 answer

HyperLogLog Implementation on Redis Not Recognized

I'm trying to run a simple code here which simply inserts a value into a key using the PFADD operation but I get this error: ResponseError: unknown command 'PFADD' My code is as follows: import pandas as pd import redis r =…

python redis redis-py hyperloglog

asked Nov 22 '16 at 22:10

Augmented Jacob

1,567
19
45

Prev 1 2 3 4

6 Next