12

What would I use instead of SQL joins when I have a large complex relational-database that just got too large to fit on a single machine? I've begun sharding the database across many machines, but as a result, I can no longer do joins efficiently.

Any tips?

David Xu
  • 5,555
  • 3
  • 28
  • 50
  • 1
    http://dba.stackexchange.com/ is the place for this sort of question. SO is for code and algorithm type questions. – O. Jones May 01 '14 at 12:14
  • 2
    Well I think it still has something to do with algorithms - as it isn't just about the database, I may need to rewrite much of the software infrastructure altogether. – David Xu May 01 '14 at 12:16
  • a lot of different flavors of doing this but basically you want to precompute values such that if you were joining users and profiles tables, you might have userprofiles that gets created / updated when the underlying users and profiles is changed. In effect, userprofiles is closer to caching. You can even just write json strings in userprofiles. Once, you have userprofiles, you can create several read only instances and have achieved horizontal scaling – timpone May 01 '14 at 12:18
  • Wouldn't that require me to keep much of the data in memory? Also, much of the data is randomly spread, causing me to go to Node A to do a fetch, Node X to do a fetch then node D for example, causing overall response speed to slow down. – David Xu May 01 '14 at 12:23
  • 1
    well, it depends what you mean in your original question by `, I can no longer do joins efficiently`. What is your bottleneck? Usually it's disk I/O and CPU. – timpone May 01 '14 at 12:40
  • Okay, well basically the problem is I used to do a lot of joins that ended up joining lots of tables. Now that the DB is sharded across many machines, I can't just do the join in one machine, I need to fetch the data from many machines, combine it and return it to the client. – David Xu May 01 '14 at 12:45
  • What DB are you using? Many support what you are talking about. That being said, in general, sharding is designed to have full copies of your schema spread across multiple machines and each machine supports a set of ids like these `users` are on this machine. So you shouldn't have to be joining across machines in that case. – timpone May 01 '14 at 13:49
  • Isn't that read replicas? Sharding is when the data is split between different machines? I'm using MySQL v5.6 on AWS RDS by the way. – David Xu May 01 '14 at 13:51
  • so in general you shard based upon a partition like User or Location which can either be self-contained or you can do joins across hosts (MySQL federated allows this but have never used it; are you using MySQL Cluster or just community edition?). So you have to structure your data so that your problem couldn't happen (see http://en.wikipedia.org/wiki/Shard_(database_architecture) ). To fix this, you usually precompute so that you aren't joining. In general, this is the problem with sharding and why I generally try to use memcache or json precomputing (or both) first – timpone May 01 '14 at 16:14

2 Answers2

15

There are many approaches to make this work, the general idea is to shard your data in such a way as to group related data.

As a simple (trivial) example, if you have a Game database, you can shard Player and PlayerGame data by the same key (playerId). If there are other tables that are related, you can add those too, think of it as a "shard tree" of related tables. Then all the data for a given Player will be guaranteed to be in the same shard. You can then perform joins within a shard, but you cannot do inner joins across shards.

The other common technique is to replicate Global tables to all shards, these are typically tables that are not updated often, but are used in lots of joins.

With these two approaches you can:

  • Join within the Shard Tree (but not a cross-shard inner join, e.g., between 2 players)
  • Join from a sharded table to a Global table at any time

Then the other trick is distributed queries, where you may need to rollup results from many shards (e.g., a count of all Players).

Here is a white paper that describes a lot of this in more detail:

http://dbshards.com/dbshards/database-sharding-white-paper/

The key to this type of approach is to understand how you want to query the data. The answer above can also be useful, to de-normalize some data when you have to query it from a different perspective. In that case you need to write the data in two (or more) formats, and partition your shards according to each structure. Again using the simple example above, let's say you need to query all the Players for a single GameInstance. Now you could make a separate "shard tree" with GameInstance as the parent and PlayerGame as the child, sharded by GameInstanceId. Now that query will be efficient too.

The goal is to have as many single shard operations as you can, as distributed operations oddly enough are generally the "evil" of a distributed database cluster.

dbschwartz
  • 763
  • 3
  • 10
2

Depending on the data you are using, you could potentially denormalize it and spread it across different DB nodes. That would make you writes a bit more tricky, but would improve read performance.

VHristov
  • 1,059
  • 2
  • 13
  • 25
  • 1
    Can you explain in a bit more detail? – David Xu May 01 '14 at 12:38
  • 1
    Basically if you currently store two object A and B in two separate tables and join them by a common attribute p and q, you can create two new tables P and Q, where you already store the the join of A and B on p and q respectively. Check out http://en.wikipedia.org/wiki/Denormalization for a maybe better explanation. – VHristov May 15 '14 at 14:55
  • @DavidXu This talk might help - especially her remarks at the end about Twitter's "precomputed feeds". [https://youtu.be/p3ytSdUQZzA?t=37m16s](https://youtu.be/p3ytSdUQZzA?t=37m16s). – Andy Carlson Jun 26 '17 at 01:09