Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions
1
vote
1 answer

Which hash function does HyperLogLog use?

I have read in a few articles that HyperLogLog and LogLog use a hash function and that it is solely responsible for the prediction value. If we assign a value to a certain username to predict the number of times the individual has visited a page,…
1
vote
1 answer

How to understand that the standard error of redis hyperloglog is 0.81%

I am confused with hyperloglog standard error 0.81%, so I change rand() to $n+$j in https://github.com/redis/redis/blob/unstable/tests/unit/hyperloglog.tcl#L48 and change 5%->0.81% in…
ming
  • 101
  • 4
1
vote
0 answers

Error when trying to process HyperLogLog created on Snowflake, in Trino

In Trino, I'm getting the error message Cannot deserialize HyperLogLog: I have a query on Snowflake, doing the following: select __TENANT_ID hll_accumulate(VISITOR_ID) as visitor_hll from [table] where …
1
vote
1 answer

Redis - Count distinct problem (without hyper log log)

I should solve a count-distinct problem in Redis without the use of HyperLogLog (because of the 0.81% of known error). I got different requests with a list of objects [O1, O2, ... On] for a specific Key A. For each list of objects received, Redis…
lordav
  • 105
  • 1
  • 2
  • 10
1
vote
0 answers

How do we use BigQuery HLL (HyperLogLog) functions in Looker

I have a quick question on how we can use the BigQuery HLL functions in Looker. For example, there is a BigQuery table with the following structure Sample BigQuery Table In looker do I need to define this field respondents_hll as a dimension or…
iPrithvi
  • 11
  • 1
1
vote
1 answer

how do I increase the accuracy of redis hyperloglog

I am using a very simple implementation of redis HLL PFADD to add the elements and PFCOUNT ( something with PFMERGE ) to get the count Is there a way I can tune the efficiency of redis HLL , by increasing memory allocated etc
Ram
  • 1,155
  • 13
  • 34
1
vote
1 answer

Using HyperLogLog functions in BigQuery can you get different results from the same query on the same data?

My query looks like: SELECT HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key.item) FROM UNNEST(data.list) key)), FROM dataset let's say I run this query 10000 times (on the same set of data), will I get 10000 identical results or a small percentage…
Ire00
  • 79
  • 6
1
vote
1 answer

Django culmulative sum of HyperLogLog (HLL) Postgres field

I'm using the HyperLogLog (hll) field to represent unique users, using the Django django-pg-hll package. What I'd like to do is get a cumulative total of unique users over a specific time period, but I'm having trouble doing this. Given a model…
Darkstarone
  • 4,590
  • 8
  • 37
  • 74
1
vote
1 answer

BigQuery to Data Studio : Show reliable COUNT DISTINCT regardless of the selected period

in my BigQuery project I store event data integrated from Firebase. The granularity and dimension is such that trying to present raw data in Data Studio quickly makes the report become VERY slow (1-2 min per page/interaction). I then started to…
1
vote
1 answer

Distinct Count algorithm

I am wondering if it is possible to do an approximate distinct count in the following way: I have an aggregation like this: +---------+----------------------+-------------------------------+ | country | unique products sold | helper_data --…
David542
  • 104,438
  • 178
  • 489
  • 842
1
vote
1 answer

URL filtering on top of Redis: Bloom filters or HyperLogLog data structure

I want to implement URL filtering for the distributed crawling system on top of Redis database (e.g. don't visit the same URL twice, so I need somehow to keep tracking all of them with the minimal memory fingerprint, there is no need to store full…
d-d
  • 1,775
  • 3
  • 20
  • 29
1
vote
0 answers

How does hashing a stream of values guarantees randomness in hyperloglog?

From this stackoverflow post The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the…
1
vote
1 answer

Execute extract on Tableau for distinct count using HLL

I have a somewhat huge table (130 million rows), that I am able to crunch on the same server in under 10 minutes, and produce a slimmed-down, pre-aggregated table, that works just fine and everyone is happy to use it. The table is grouped by around…
Alex
  • 14,338
  • 5
  • 41
  • 59
1
vote
2 answers

Is there any effective way to reduce error in HyperLogLog ( redis )?

In redis , we treat hyperLogLog as set to distinct elements. As everyone knows, for each key, HLL consumes only 12kb memory and produces approximations with a standard error of 0.81% Since I got so much elements to count. So here I wanna to lower…
1
vote
1 answer

How to get unique user count for custom Firebase event with multiple dimensions applied?

I'm currently trying to count unique users for my custom Firebase events in BigQuery. While I've been able to get to the figures in aggregation by using the APPROX_COUNT_DISTINCT function, I'm still stuck to get the correct (unique) count when…
Peter P
  • 491
  • 4
  • 14