12

I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.

From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].

The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).

Is there a good way to make clustering field for an existing partitioned table?

Any comment, suggestion, or answer is well appreciated.

Thanks a lot, Yosua

Yosua Michael
  • 165
  • 1
  • 2
  • 7

3 Answers3

19

BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.

You can change the clustering specification in the following ways:

Call the tables.update or tables.patch API method.

Call the bq command-line tool's bq update command with the --clustering_fields flag.

Reference

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

SANN3
  • 9,459
  • 6
  • 61
  • 97
12

What I missed from the above answers was a real example, so here it goes:

bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table

Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.

Leonardo Campos
  • 158
  • 1
  • 4
  • +1 thank you. I tried adding the project-id and it did not like it... I needed to 'gcloud config set project PROJECT_ID' and then run the command from the correct project. – Nij Jun 23 '23 at 06:56
  • 1
    The command to use another project is: bq update --project_id=the_project_to_run_from --clustering_fields=tool,qualifier,user_id my_project:my_dataset.my_table In the resource part, a colon separates the project from the dataset and table. – Leonardo Campos Jun 27 '23 at 12:26
9

This answer is no longer valid / correct

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec


You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered

The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge

dogmatic69
  • 7,574
  • 4
  • 31
  • 49
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks! Yeah, I think I can use the work around for now :) – Yosua Michael Aug 30 '18 at 09:41
  • Hi, I have two table one is non-cluster and cluster table with same fields. How to copy the data from non-cluster table to cluster table if both table exist. ? – Rumesh Krishnan Jan 14 '19 at 10:37
  • 1
    I think this answer is not completely correct. According to the official docs : `By calling the tables.update or tables.patch methods, table clustering specifications can be changed or removed. The set of clustered columns in a clustered table can also be changed to a different set of columns. When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward.` What I understood is that you can specify a clustering column also later. It's just that data added previously won't be clustered – Seba92 May 14 '20 at 13:37
  • the answer was given in August 2018 at which time things were different :o) – Mikhail Berlyant May 14 '20 at 13:39
  • Is this working by now? I'm trying to update an existing table with clustering information, but the API responds with a `"Cannot add, update, or remove clustering without partitioning field."` error (that doesn't really make sense to me). – Hirnhamster Jan 08 '21 at 09:06
  • I have the strong suspicion that the docs are lying... after adding a partitioning column the error becomes `"Cannot change partitioning/clustering spec for a partitioned/clustered table."` - which imho contradicts the statement in the docs. – Hirnhamster Jan 08 '21 at 09:17
  • After some more fiddling, it works as expected but only on tables with field partitioning. I believe this might be a bug, because one can CREATE non-partitioned tables with clustering - so it should also be possible to CHANGE the clustering on those tables. I created an issue at https://github.com/googleapis/google-cloud-php/issues/3655 – Hirnhamster Jan 08 '21 at 10:18
  • As of Aug 15, 2022, it appears to allow clustering to be added to tables that only have `_PARTITIONTIME` partitioning (not field partitioning) too. I was able to add clustering to a table like this, that was created about 5 years ago, in my project. – Matt Welke Aug 15 '22 at 19:41
  • @MikhailBerlyant - The workaround is the only option even now when we add a clustering field for an existing table. The old data can only be re-clustered if we re-create the table and load the table. Is there any other alternative? I couldn't find any document to do this even now. – User-8017771 Mar 01 '23 at 10:14