Optimizing Cloud Spanner for Read Latency

Question

I designed a table schema for my Spanner database that takes advantage of interleaved tables to store associated records. I did this to optimize read times. My schema looks something like:

CREATE TABLE all_ids (
  userid INT64 NOT NULL,
) PRIMARY KEY(userid);

CREATE TABLE connections_in (
  userid INT64 NOT NULL,
  child_userid INT64 NOT NULL,
  connect_time TIMESTAMP NOT NULL,
) PRIMARY KEY(userid, child_userid),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

CREATE TABLE connections_out (
  userid INT64 NOT NULL,
  child_userid INT64 NOT NULL,
  connect_time TIMESTAMP NOT NULL,
) PRIMARY KEY(userid, child_userid),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

The connections_in and connections_out tables store graph connections between nodes in a user database. Each user may have at most several hundred connections.

I also want to store the metadata history of each user over time and I'm not sure if it makes sense to interleave that data into the parent table all_ids or not. Each user has several thousand rows of metadata through time:

CREATE TABLE user_metadata (
  userid INT64 NOT NULL,
  snapshot_time TIMESTAMP NOT NULL,
  username STRING(1024) NOT NULL,
  user_description STRING(1024) NOT NULL,
) PRIMARY KEY(userid, snapshot_time DESC),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

I know that spanner uses lexicographic order to order primary keys, so does that mean that a row from the user_metadata table could be physically collocated next to a row from connections_in or connections_out since the first part of both primary keys is taken from the table all_ids? And if so, does that mean read times from connections_in/_out based on userid would be slower than if I created a separate non-interleaved table in the same database for user_metadata like?:

CREATE TABLE user_metadata (
  userid INT64 NOT NULL,
  snapshot_time TIMESTAMP NOT NULL,
  username STRING(1024) NOT NULL,
  user_description STRING(1024) NOT NULL,
) PRIMARY KEY(userid, snapshot_time DESC);

Any help is greatly appreciated!

Can you provide some examples of how you are expecting these tables to be queried? — Derek Downey, Oct 17 '22 at 20:00
@DerekDowney I wanted to do something like: SELECT * FROM user_metadata WHERE userid = 123 AND snapshot_time > '2023-01-01' LIMIT 100. — randomdatascientist, Mar 08 '23 at 10:48
I actually ended up implementing this in production on a 500 compute unit spanner node and loaded 10 interleaved tables, 5 of which have between 78 million and 1.1 billion rows and up to 26 columns across. Read latency (throughput??) is phenomenal and I can easily retrieve a few hundred records for a given userid (or many userids across multiple concurrent workers) in ~50ms. The same is true for writing although it's not quite as fast. — randomdatascientist, Mar 08 '23 at 11:10

score 1 · Accepted Answer · answered Mar 07 '23 at 13:56

Spanner uses interleaving to collocate data within the same split, therefore to assign related data to the same compute node for processing and join purposes. It's not really a row-by-row mechanism; inserting a user_metadata record won't necessarily cause the all_ids record to move a disk sector further away from the connections_out record or anything like that.

At the level of splits, there can be a performance impact, but the answer is "it depends." There are two scenarios:

userid is high-cardinality

Let's say you have 100 Spanner nodes and 1,000,000 distinct userids (and usage per user is at least roughly uniform). In this case, Spanner generally won't need to look to the second key in the key list in order to uniformly distribute the data reasonably uniformly around the cluster, and will tend to try to keep things clustered by just the first key (userid) for performance and simplicity. So all data for each user will be stored together, even if you have a bunch of interleaved tables.

userid is low-cardinality

Let's say you have 100 Spanner nodes and 8 users. But those users each have lots and lots of connection_in, connection_out, and/or user_metadata records. In this case, Spanner will want to split up the data for each user across multiple Spanner nodes so that it can take full advantage of the 100-node cluster. In this case, joining all of the data for a single user back together would require data from several Spanner nodes -- that's a distributed JOIN which is (generally speaking) more expensive than having all the data local to a single node.

If you have a large number of users but one of those users has dramatically more requests/sec spread across its different child records than others, Spanner may decide to split up the records associated with just that one extra-large userid. This is not a common case, but it can happen if you have serious data- and load-skew as a means to avoid overloading a single Spanner node handling the workload for that one user and causing an even larger latency spike.

Hi Adam, thanks for the detailed explanation. My use case is closer to 2 than to 1, although there are some userids that have many more connections in/out than the average userid. I actually ended up implementing this solution in production on a 500 compute unit node with multiple interleaved tables containing hundreds of millions of rows each and it works like a charm. I assume you're a google engineer working on Spanner given your detailed explanation. If that assumption is correct, thanks for making such an amazing product! Idk how you guys do it but it is blazing fast. — randomdatascientist, Mar 08 '23 at 21:49
Sorry I misspoke - my use case is closer to 1, not 2 with ~10k userids that have many more connections than the average userid (10x-1000x). Would this cause problems? There are ~80 million userids in total. — randomdatascientist, Mar 09 '23 at 04:36
Thanks! Yeah, I'm on the Spanner team here. Glad Spanner has been working well for you! Generally, if you have data skew, you'll run into issues if your biggest user wants to be bigger than a whole node can handle. With 80M users and the biggest user being as big as 1,000 small users, that means you might have trouble scaling past (very roughly) 80M/1000 = 80,000 full Spanner nodes, not counting read replicas. Which Spanner can do. But if you're able to serve 80M user records from a single 500-processing-unit Spanner node today, I think you've got plenty of room to scale :-) — Adam Seering, Mar 09 '23 at 23:53
If you did need to scale past 80,000 Spanner nodes or so without getting any more users, what might happen is that your largest users (which I imagine would have a very large number of connections at that point) might have their data handled by multiple effectively-dedicated nodes. Whereas regular users would have data collocated / not split across nodes. This would still work fine; you would just see a modest latency increase for those very large users due to needing to fetch and combine data from 2+ nodes instead of just one. (Complex query plans might be more affected than simple ones.) — Adam Seering, Mar 09 '23 at 23:56

Optimizing Cloud Spanner for Read Latency

1 Answers1