Optimizing hierarchical data sets for reads of whole hierarchies

Question

I am migrating an app from Oracle to Google Spanner. One of the cases we came across are relationships between rows in the same table.

These relationships have a tree-like structure, always having one parent and one root of the hierarchy altogether. Bottom up and top to bottom query patterns are possible.

There will be cases where we'd like to have efficient access to the whole record-tree. This data access pattern is latency critical.

The application previously used Oracle and their hierarchical queries (connect by) and was highly optimized for that vendor.

The number of rows in one tree-fetch would range between 1-2000. Table will have millions of sych rows.

Rows of that table do have interleaved child table rows within.

Would it make much sense to optimize the table for better data locality by denormalizing the model and redundantly adding the root record's id as the first column of the primary key of that table for faster top-down queries?

It would go like this:
root_id | own_id | parent_id 1 | 1 | 1 1 | 2 | 1 1 | 3 | 2 4 | 4 | 4 4 | 5 | 4 4 | 5 | 4

Ie. we are considering to make PK consist of (root_id, own_id) here. (values are superficial, we can spread them out in real scenario).

What is the chance for such rows, containing same first element of the PK to go to the same split? Would there be actual benefit to do so?

score 1 · Answer 1 · answered Mar 06 '18 at 01:35

1

Cloud Spanner supports parent-child table relationships to declare a data locality relationship between two logically independent tables, and physically co-locate their rows for efficient retrieval. Please see this link for more information: https://cloud.google.com/spanner/docs/schema-and-data-model#parent-child_table_relationships

For example, assuming we have a table 'Root' with primary key 'root_id', we can declare the table 'Own' to be a child of the 'Root' table. The primary key of the parent table becomes a prefix to the primary key of the child table. So table 'Own' could have a primary key of (root_id, own_id). All rows of table 'Own' having the same 'root_id' would be located in the same split.

Splits do have a max size limit. As a rule of thumb, the size of every set of related rows in a hierarchy of parent-child tables should be less than a few GiB.

answered Mar 06 '18 at 01:35

Rose Liu

251
1
5

Thanks, yes, I know that case, but I don't think it is responsive to the one I described: a multi level, varying depth hierarchy of referencing rows (having same structure, hence in single table) that we want to query top to bottom in a single query. Is there any means to help data locality for such queries? – wojciechka Mar 07 '18 at 19:51
The design of the primary key in a child table can be used to control locality within the child table. It seems you have the right idea of adding the root record's id as the first column of the primary key of that table, but you would need to interleave this (child) table in the parent table whose primary key is root_id. This would cause all rows of the child table having the same root_id to be in the same split. To prevent the size of the split from becoming too large, a shard_id can be added to the primary key of the parent. – Rose Liu Mar 10 '18 at 01:23

Optimizing hierarchical data sets for reads of whole hierarchies

1 Answers1