Why does Hyperloglog work and which real-world problems?

Question

I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?

score 1 · Answer 1 · answered Dec 18 '15 at 00:42

("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).

I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.

... This problem has received a great deal of attention over the past two decades, finding an ever growing number of applications in networking and traffic monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service), and of link-based spam on the web [3]. For instance, a data stream over a network consists of a sequence of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed by a body of specific data; the number of distinct header pairs (the cardinality of the multiset) in various time slices is an important indication for detecting attacks and monitoring traffic, as it records the number of distinct active flows. Indeed, worms and viruses typically propagate by opening a large number of different connections, and though they may well pass unnoticed amongst a huge traffic, their activity becomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [11]). Other applications of cardinality estimators include data mining of massive data sets of sorts—natural language texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, where the authors of [22] report computational gains by a factor of 500+ attained by probabilistic cardinality estimators.

Well, am not really asking for "Applications for cardinality estimation", given the Hyperloglog's standard error, "in which real-world applications is it really used today"? — Chenna V, Dec 18 '15 at 00:52

score 0 · Answer 2 · answered Dec 24 '15 at 10:41

0

At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.

answered Dec 24 '15 at 10:41

OronNavon

1,293
8
18

Cool, Thanks. Actually, you just gave me a better understanding of how I can answer some questions from the data we have. May I ask one question, Does it give reasonable/usable answers for your questions? – Chenna V Dec 24 '15 at 15:39
Yep, we use them all the time. – OronNavon May 04 '16 at 20:48

score 0 · Answer 3 · answered Aug 08 '22 at 01:38

Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.

It could be implemented with set. every question would have a set that stores the usernames:

  question#ID121e={username1,username2...}

For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.

Why does Hyperloglog work and which real-world problems?

3 Answers3