2

I don't have experience scaling-out SQL back-end, but from what I've read so far, sharding for writes and caching for reads seem to be two of the most common practices. I'm trying to learn how eventual consistency could be minimized with the right caching strategy.

I'd like to use Azure SQL Database, Entity Framework & Elastic Scale middleware, and Redis for testing purposes.

Is there a way to commit a distributed transaction to both SQL Server and Redis?

If not, what's an effective way to ensure read freshness when a database change occurs?

I could write to SQL and update cache in the same API, but writing to cache might fail for whatever reason. I could implement a retry logic, but assuming all attempts fail, I could try to rollback the SQL transaction or simply serve old cache data to clients and periodically rebuild cache to catch up with the database. Of course, the latter would mean data reads are not consistent for a period of time. Evicting data and reading from the SQL cluster is another option, but cross-shard queries might be very expensive, especially when they involve complex joins and you have hundreds, if not thousands, of databases on commodity hardware.

Mark13426
  • 2,569
  • 6
  • 41
  • 75
  • Asking for third-party frameworks is off-topic for StackOverflow. You should remove that part from your question so that it's not closed for that reason. – David Makogon Sep 30 '15 at 16:08

2 Answers2

1

Your idea in the last part of your post-- write to SQL and update cache in the same API-- seems reasonable. I'd change it slightly: Let cache have some low but reasonable TTL, say 1 minute. Only update cache on reads when the DB is hit, so that on next db query the cache is hit instead.

Pros:

  1. Past the 1min mark, you know your users are getting the right data.
  2. Fault-tolerant: If there's something wrong with cache and you can't update it for whatever reason, the next query just goes to DB instead and the client still gets the correct data.

Cons:

  1. You'll have slightly more reads against DB (though 1 extra read per key per minute shouldn't be a big deal).
  2. Clients could get old data for up to 1 minute (at most-- usually much less) past an update.
Eli
  • 36,793
  • 40
  • 144
  • 207
1

One of the most important principle for sharding is trying to avoid "cross-shard" queries. If that is not the case and also you need to make joins I don't think sharding can help you. Neither the read cache. We are using sharding heavily in a couple of project. And it really helps a lot for scalability but as I said very small amount of our service needs more than one shard. That is the case because we chose our shard key considering the service needs. And I think the most important decision for sharding is that choosing the right shard key. If you can find a "perfect" sharding key you can just use database directly without any cache at all. At least that is what we are doing now.

cool
  • 1,746
  • 1
  • 14
  • 15
  • Thanks. I agree that cross-shard queries can be problematic. It'd be interesting to see how a company like FB partitions user data across MySQL nodes. My first thought was they shard based on UserId and any user activity gets written to that user's shard, but when you think about aggregated results such as user wall and friend lists, that kind of data might also be saved to the same shard. – Mark13426 Oct 01 '15 at 21:55