Questions tagged [hyperloglog]

Hyperloglog is an approximate technique for computing the number of distinct entries in a set.

Hyperloglog is an approximate technique for computing the number of distinct entries in a set implemented in Algebird, a scala library for abstract algebra. This can be used in Summingbird to create MapReduce programs for estimating cardinalities of large datasets in streaming (online) or batch (offline) mode. Data structure store Redis also has HyperLogLog implementation.

89 questions
3
votes
1 answer

Speeding up my implementation of HyperLogLog algorithm

I made my own implementation of HyperLogLog algorithm. It works well, but sometimes I have to fetch a lot (around 10k-100k) of HLL structures and merge them. I store each of them as a bit string so first I have to convert each bit string to buckets.…
skaurus
  • 1,581
  • 17
  • 27
3
votes
1 answer

Redis Hyperloglog - PFCOUNT side effect

Redis recently released their new data structure called the HyperLogLog. It allows us to keep a count of unique objects and only takes up a size of 12k bytes. What I don't understand is that Redis's PFCOUNT command is said to be technically a write…
JHAWN
  • 399
  • 4
  • 18
2
votes
2 answers

Count unique users in last 60 mins per page with Redis HyperLogLog

I’m designing an algorithm to count unique users on a set of pages, based on a 60min sliding scale So it needs to find unique IPs (or tokens) that have hit a particular page and total up those hits within the last 60 mins I need this to be very fast…
Ben
  • 155
  • 2
  • 12
2
votes
1 answer

HLL+ Precision for Google BigQuery

The precision of using HLL.INIT(...) and HLL.MERGE(...) is described here: https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions However, I'm wondering if there is ever a cardinality size, under which point HLL is guaranteed to…
David542
  • 104,438
  • 178
  • 489
  • 842
2
votes
2 answers

What is hyperloglog and why is this good for?

I was studying data structures supported by Redis and I was not able to find out an explanation that could make me understand what HyperLogLog is. How do I use it and why is this good for?
Vikto
  • 512
  • 1
  • 7
  • 19
2
votes
1 answer

redis HLL too many false positives

Hyperlog log is a probablistic algorithm According to the redis HLL document , we could get 0.81% of error but I get errors like 17-20% I think there is something wrong .. This is my simple perl test script. Is there some error #!/usr/bin/perl…
Ram
  • 1,155
  • 13
  • 34
2
votes
0 answers

Probabilistic algorithm for set cardinality with support deleting from set

Is there any probabilistic algorithm for calculating set cardinality taking into account that must support deleting elements from set? I've been using HyperLogLogs for calculating cardinalities of some sets and their unions but when necessity of…
user7014602
2
votes
1 answer

Redis Hyperloglog limitations

I am trying to solve a problem in a hacky way using Redis Hyperloglog but what I am trying to understand is the limitations and assumptions by Hyperloglog on the data or the distribution. The count-min and bloom filter have their own set of…
Chenna V
  • 10,185
  • 11
  • 77
  • 104
2
votes
1 answer

How to improve performance of PIG job that uses Datafu's Hyperloglog for estimating cardinality?

I am using Datafu's Hyperloglog UDF to estimate a count of unique ids in my dataset. In this case I have 320 million unique ids that may appear multiple times in my dataset. Dataset : Country, ID. Here is my code : REGISTER…
mnadig
  • 81
  • 2
  • 7
2
votes
1 answer

HyperLogLog intersection: why not use min?

When doing a union between two compatible HyperLogLog objects, you can just take the maximum bucket to do a lossless union that doesn't introduce any new error: Union.Bucket[i] = Max(A.Bucket[i], B.Bucket[i]) When doing an intersection though, you…
Alan Wolfe
  • 615
  • 4
  • 16
2
votes
1 answer

why is data.fu implementing HyperLogLog as an accumulator and not as algebraic?

data.fu has a nice implementation of HyperLogLog for estimating cardinality here However, it's implemented as Accumulator which means it will run only at the reducer and not in the combiner (but it will never load the entire set into memory as in…
ihadanny
  • 4,377
  • 7
  • 45
  • 76
2
votes
1 answer

HyperLogLog correctness on mapreduce

Something that has been bugging me about the HyperLogLog algorithm is its reliance on the hash of the keys. The issue I have is that the paper seems to assume that we have a totally random distribution of data on each partition, however in the…
aaronman
  • 18,343
  • 7
  • 63
  • 78
2
votes
1 answer

How to apply hyperloglog to a timeseries stream

Can someone explain or link to an explanation about how counting the cardinality of a set with HLL can be used for time series analysis? I'm pretty sure druid.io does exactly this, but I'm looking for a general explanation of how to do this with HLL…
Emmanuel Oga
  • 354
  • 2
  • 12
1
vote
0 answers

Writing HyperLogLog Sketches from Apache Spark To Trino

I'm attempting to generate aggregate HLL sketches in a Scala Spark job and push the data to a varbinary in Trino for dashboard aggregations. I'm using the spark-alchemy library to generate the sketches in Spark, but continue to run into…
J.Fratzke
  • 1,415
  • 15
  • 23
1
vote
2 answers

PostgreSQL - HyperLogLog extension not found

Can someone explain in a better way (well, in a way for dummies to understand), or more correctly how to install HyperLogLog hll extension for PostgreSQL on my Mac M1 machine. When running CREATE EXTENSION hll; I get: Query 1 ERROR: ERROR: could…
liliget
  • 291
  • 4
  • 12