1

I'd like to get some insight into how various companies solve counting/incrementing the number of "likes"/"views"/"retweets" or something similar at scale.

At userbases past 50 million monthly active users, I've seen both Redis and Cassandra used to store sets of userIds to quickly retrieve set cardinality (count of viewers, for example). These solutions have some warts but work well and can and are being scaled out. However, I'm curious what other shops use in this case.

Specifically, do the solutions:

  • Use sets, or other data structures, or just plain key-value?
  • Exact or approximate counts,?
  • In-memory only, or hybrid?
  • Open source solution, or home grown?
  • Has anybody built a lightweight set-only storage system with hyperloglog estimation on top of it?
Don Work
  • 45
  • 3
nflacco
  • 4,972
  • 8
  • 45
  • 78

1 Answers1

2

Use sets, or other data structures, or just plain key-value?

HyperLogLog is a powerful algorithm that can give you the number of unique users/views with a tiny space storage, provided some approximation.

Exact or approximate counts,?

For this scale, exact count is useless and not meaningful. After all when you have 50 millions users, knowing that you have 1.34 millions of unique visitors for an item with 2% error margin is fairly enough.

In-memory only, or hybrid?

It depends on your requirement in term of latency. In memory grant very fast access but with the risk of data loss. You can use in memory with persistent storage backing

Open source solution, or home grown?

Do not re-invent the wheels. Use well proven and battle field proven tools

Has anybody built a lightweight set-only storage system with hyperloglog estimation on top of it?

As far as I know, Redis offers HyperLogLog as datastructure so you can just use it. Use disk persistence to checkpoint frequently the hyperloglog data structure to disk to avoid loosing it when the node goes down

Otherwise, you can also implement the HyperLogLog algorithm in Cassandra leveraging the fact that Cassandra use max(timestamp) as resolution rule so just tricks the database and store the HyperLogLog bucket value as timestamp.

But it means that you need to do the impl yourself, with the possibility of bugs.

doanduyhai
  • 8,712
  • 27
  • 26
  • Actually, exact counts are useful and meaningful. The 100m+ DAU solution I built currently gives exact counts and is relatively durable. My question is what do other places with lots of traffic do. – nflacco Apr 08 '16 at 19:47