0

We have a use case where we want to report on the unique visitors in our app across any time range (hour granularity).

Example: Suppose at hour 0 we had following visitors {A, B, C, D} and at hour 1 we have {C, D, E, F} , at hour 2 we have {E, F, A, B} and at hour 3 we have {A, C}. We need to answer how many unique visitors were there between hour 1 and hour 3 and at same time should be able to answer number of unique visitors between hour 0 to hour 3 etc. ?

Of-course we cannot save all the unique visitor IDs, but we can save a BloomFilter for a given hour.

I was planning to use inclusion-exclusion property to calculate the unions, but would like to see if there any frameworks or some one has a good solution.

Big Data Technologies: we have hdfs setup, with hive and also Spark, Kafka.

  • In my current solution at each hour I am planning to calculate the new visitors compared to the previous hours. Example when processing data for hour 5, I am planning to calculate the following : 1. Unique visitors in hour 5, 2. New visitors in hour 5 which were not in hour 4. 3. new Visitors in hour 5 which were not in hour 3 and hour 4 and so on.... – Girish Subramanian Apr 05 '17 at 02:32
  • you should be looking at spark streaming, it has many built in [transformations and window operations](http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams) that your use case need. – Ronak Patel Apr 05 '17 at 12:39

1 Answers1

0

You could use the HyperLogLog algorithm. HyperLogLog sketches are very space efficient and can be easily merged to construct unions. See http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf.

otmar
  • 386
  • 1
  • 9