2

I have a big query table which is day partitioned. There are some 10M records present in this table. Is it possible to update the table to add clustering columns? If yes, will the sorting be done on existing data? I looked up the docs but couldn't get a clear picture.

praneethh
  • 263
  • 4
  • 16

2 Answers2

7

EDIT: My previous response about clustering configuration being immutable was incorrect, it can be modified after creation. Thanks fremzy@ for pointing this out.

To do this with the bq CLI:

bq update --clustering_fields=field1,field2,field3 mydataset.mytable

My previous instructions about creating a new table via a CTAS below:

The safest way to do this is probably to leverage a CREATE TABLE ... AS SELECT ... statement to define the new table and select everything from the partitioned table into it.

There's a relevant example in the DDL documentation which covers the syntax for defining partitioning and clustering: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#example_1_2

shollyman
  • 4,216
  • 19
  • 17
  • Caveat: with the `update` command, data will only be clustered from that point forward unless recreating the table. https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec – Brian Bien Mar 25 '23 at 17:32
1

I think the above answer is not totally true, there isn't enough documentation around this. But you can follow it here Make existing bigquery table clustered and https://github.com/googleapis/google-cloud-php/issues/3655

Summary, is 1) you can add clustering later if your table is partitioned 2) you might also be able to add clustering on non-partitioned table with the recent updates as listed on the github issue

Also note that, any updates to clustering are only effective for the data created from that point of time. So your pre-existing rows will still be not clustered ref- cloud documentation

Fremzy
  • 290
  • 2
  • 12