1

I have come across multiple algorithms such as Flajolet-Martin algorithm , HyperLogLog to find out unique elements from a list of elements and suddenly became curious about how Java calculates it? And what is the Time-complexity in each of these cases to store and find unique values?

Phenomenal One
  • 2,501
  • 4
  • 19
  • 29
  • java.util.Set is an interface, not an implementation. There are two commonly used implementations in the JDK libraries: java.util.TreeSet and java.util.HashSet. Neither of them uses HyperLogLogs. – Erwin Bolwidt Oct 22 '17 at 01:58
  • I'm also not sure how that would be applicable to the java.util.Set interface, as the API requires all elements to be kept, and it requires uniqueness. The HyperLogLog algorithm estimates the cardinality of a **multi**set (a bag) when there are too many elements in it to be kept in memory at the same time. – Erwin Bolwidt Oct 22 '17 at 02:00
  • It's in the name. HashSet uses a hash table. TreeSet uses a tree. – Boann Oct 22 '17 at 02:12
  • 1
    Neither of Flajolet-Martin algorithm or HyperLogLog would be suitable for a Map data structure. They are about **counting** distinct elements in a stream. – Stephen C Oct 22 '17 at 02:20
  • 1
    Why not have a look for yourself? JDK is open-source. The sources are [here](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util) – Kevin Anderson Oct 22 '17 at 03:32

2 Answers2

5

Flajolet-Martin and HyperLogLog algorithms are about getting an approximate count of the distinct elements (the count-distinct problem) in one pass of a stream of N elements with O(N) time and modest (much better than O(N)) memory usage.

An implementation of the Map API does not need a solution to the "count-distinct" problem.

(Aside: TreeMap and HashMap already keep a precomputed count of the number of entries in the map1; i.e. Map.size(). Provided that you don't get into thread-safety problems the result is accurate (not approximate). The cost of calling size() is O(1). The cost of maintaining it is O(U) where U is the number of map addition and removal operations performed.)

More generally, Flajolet-Martin algorithm or HyperLogLog do not / cannot form the basis for a Map data structure. They do not address the dictionary problem.

The algorithms used by HashMap and TreeMap are (respectively) hash table and binary tree algorithms. There are more details in the respective javadocs, and the full source code (with comments) is readily available for you to look at. (Google for "java.util.HashMap" source ... for example.)


1 - Interestingly, ConcurrentHashMap doesn't work this way ... because updating the size field would be a concurrency bottleneck. Instead, the size() operation is O(N).

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
2

The HashSet type tracks its elements using a hash table (usually, using closed addressing) and the TreeSet type tracks its elements using a binary search tree. These data structures give exact answers to the question "is this element here?" and are useful for cases where you need to know with 100% certainty whether you've seen something before, and their memory usage is typically directly proportional to the total number of elements seen so far.

On the other hand, cardinality estimators like HyperLogLog are good for answering questions of the form "how many distinct elements are there, give or take a few percent?" They're great in cases where you need to get a rough estimate of how many distinct things you've seen, where approaches like putting everything in a hash table or a binary search tree would take way too much memory (for example, if you're a Google web server and you want to count distinct IP addresses visiting you), since the amount of memory they use is typically something you get to pick up front. However, they don't permit you to answer questions of the form "have I seen this exact thing before?" and so wouldn't work as implementations of any of the java.util.Set subtypes.

In short, the data structures here are designed to solve different problems. The traditional BST and hash table are there for exact queries where knowing for certain whether you've seen something is the goal and you want to be able to, say, iterate over all the elements seen. Cardinality estimators are good where you just care about how many total distinct elements there are, you don't care what they are, and you don't need exact answers.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065