I've been spending some time looking into caching with (redis and memcached mostly) and am having a hard time figuring out where exactly to use caching when your data is constantly changing.
Take Twitter for example (just read Making Twitter 10000% faster). How would you (or do they) cache their data when a large percentage of their database records are constantly changing?
Say Twitter has these models: User
, Tweet
, Follow
, Favorite
.
Someone may post a Tweet that gets retweeted once in a day, and another that's retweeted a thousand times in a day. For that 1000x retweet, since there's about 24 * 60 == 1440
minutes in day, that means the Tweet updated almost every minute (say it got 440 favorites as well). Same with following someone, charlie sheen has even attracted 1 million Twitter followers in 1 day. It doesn't seem worth it to cache in these cases, but maybe just because I haven't reached that level yet.
Say also that the average Twitter follower either tweets/follows/favorites at least once a day. That means in the naive intro-rails schema case, the users table is updated at least once a day (tweet_count
, etc.). This case makes sense for caching the user profile.
But for the 1000x Tweets and 1M followers examples above, what are recommended practices when it comes to caching data?
Specifically (assuming memcached or redis, and using a purely JSON API (no page/fragment caching)):
- Do you cache individual Tweets/records?
- Or do you cache chunks of records via pagination (e.g. redis lists of
20
each)? - Or do you cache both the records individually and in pages (viewing a single tweet vs. a JSON feed)?
- Or do you cache lists of Tweets for each different scenario: home timeline tweets, user tweets, user favorite tweets, etc? Or all of the above?
- Or are you breaking the data into "most volatile (newest)" to "last few days" to "old" chunks, where "old" data is cached with a longer expiration date or into discrete paginated lists or something? And the newest records are just not cached at all. (i.e. if the data is time dependent like Tweets, do you treat it differently if you older records know it won't change much?)
What I don't understand is what the ratio of how much the data changes vs. if you should cache it (and deal with the complexities expiring the cache). It seems like Twitter could be caching the different user tweet feeds, and the home tweets per user, but that then invalidating the cache every time one favorites/tweets/retweets would mean updating all those cache items (and possibly cached lists of records), which at some point seems like it would mean invalidating the cache is counter productive.
What are the recommended strategies for caching data that is changing a lot like this?