What is data locality and why does it matter in Cloud Spanner vs. MySQL?

Question

Data locality and root tables aren't a concept in MySQL or PostgreSQL, so why do they matter in Cloud Spanner.

It appears that picking primary keys is very important in your schema design for Cloud Spanner.

score 2 · Answer 1 · answered Mar 03 '17 at 12:24

I think that data locality and root tables are as important in MySQL or PostreSQL as in Cloud Spanner, but they are called differently.

It's important to understand that Cloud Spanner is designed for very large databases, which are only possible to implement using a lot of traditional database servers. The usual approach to this problem, when you use MySQL or PostreSQL, is to create shards. You have to split your data into many smaller pieces and put every such piece on a different server. Of course such solution is quite complicated to implement and maintain.

Google Cloud Spanner does it for you automatically, but it's important to understand how you can manage this sharding. Every shard in Spanner is called a split. The rule is that in Spanner a split has to contain the row from the root table and all it's children. By deciding the root tables you define the potential split boundaries, which Spanner uses as he sees fit. You don't have to worry about the number of splits, data migration between the splits as they grow or shrink, etc.

If you make all tables root, then Spanner will have full flexibility in its sharding, but there is the price for that. It's difficult to maintain the consistency across the shards. You can imagine this as similar to performing a transaction on multiple MySQL or PostreSQL servers simultaneously. It's possible, but not easy. Spanner does it for you automatically as well, but there is a performance penalty.

You can read more about this model in Schema and Data Model documentation

Primary keys are also important in Spanner architecture. It allocates the root table rows to splits based on their primary keys. Consecutive keys can go to the same split. If you do a lot of transactions on such rows in the short period of time you probably want them to be distributed over more splits. You can read more in Best practices

Of course you don't see all these concepts in MySQL or PostgreSQL if you can fit your data on the single server ;)

What is data locality and why does it matter in Cloud Spanner vs. MySQL?

1 Answers1