We have a use case where we want to report on the unique visitors in our app across any time range (hour granularity).
Example: Suppose at hour 0 we had following visitors {A, B, C, D} and at hour 1 we have {C, D, E, F} , at hour 2 we have {E, F, A, B} and at hour 3 we have {A, C}. We need to answer how many unique visitors were there between hour 1 and hour 3 and at same time should be able to answer number of unique visitors between hour 0 to hour 3 etc. ?
Of-course we cannot save all the unique visitor IDs, but we can save a BloomFilter for a given hour.
I was planning to use inclusion-exclusion property to calculate the unions, but would like to see if there any frameworks or some one has a good solution.
Big Data Technologies: we have hdfs setup, with hive and also Spark, Kafka.