Changing the data type in BigQuery for a partitioned table

Question

I have a partitioned table in BigQuery to store user events, which has following schema,

user_id - String, Required
event_time - DateTime, Required
country - String, Nullable
_PARTITIONTIME - event occurred date

Above table holds two years worth of data (means 730 partitions).

For some reason I would like to change the data type from DateTime to Timestamp. I found a similar questions here and here, those are working fine for non-partitioned table. But in my case, I have partitioned table so I need to retain the value _PARTITIONTIME as such.

After some research I found, running the following query with Destination Table as table_name$20180126 seems to achieve my goal for a single partition,

SELECT user_id, CAST(event_time AS TIMESTAMP) AS event_time, country from [project-id:data_set.table_name] WHERE _PARTITIONTIME >= "2018-01-26 00:00:00" AND _PARTITIONTIME < "2018-01-27 00:00:00"

But the problem is, I have 730 partition for a table (similarly I have 10 more table like this), running above query one by one in web console or through API or through BQ command line will take ages. Is there is any other better way to achieve the use case with minimal work effort?

Although DDL has been recently launched it only support Create and Delete syntax. No update yet. Scripting this as you mentioned, is the way to go. It should not take forever as you can issue multiple one in parallel. — Pentium10, Jan 31 '18 at 20:03
@Pentium10 - if scripted, does that mean 730 full table scans ($$$) ? — Graham Polley, Feb 01 '18 at 07:22
I'm thinking you could use Dataflow or Mikhails approach over here (slightly tweaked of course): https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables — Graham Polley, Feb 01 '18 at 09:36
@JayaAnanthram you say it's already partitioned, so 730 queries each reading their respective partition, results in 1 full table scan. — Pentium10, Feb 01 '18 at 10:41
@Pentium10, Yes my table is partitioned one, so I will be charged for one full table scan (Since each query is bounded with corresponding partition). But my main concern is, in production some downstream application will be using bigquery, so I want to minimize the schema update timings and downtime for a table. (Something like, `doing for each partition doing it for complete table`, AFAIK each partition will reside in separate file so I am trying to do across files in a single command) . — Jaya Ananthram, Feb 01 '18 at 14:22
Could you patch your table with a new column. Run an update statement to populate the new column. Eventually rewrite the table by removing the old column(either keep it). https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/patch — Pentium10, Feb 01 '18 at 21:31
@Pentium10 I think I didn't get you. For DML updates, daily limit is 10,000, So I need to increase that (Fine for now). What do you mean by rewrite the table? Rewrite to a new table? If yes, then there is no point of doing UPDATE right? Simply I can cast the field while rewriting to a new table right? (Though I need to do it for each partition).. Am I correct or am i missing anything? — Jaya Ananthram, Feb 02 '18 at 05:56
Sorry I need to correct something. DML for partitioned tables are not yet supported. I did remembered wrong. You need to rewrite the table with scripting. You can test that on a copy of your dataset. The 10k limit btw is the number of queries. And a DML query can update the whole table. It's not number of rows. — Pentium10, Feb 02 '18 at 06:25
@Pentium10 I thought update should be ran for every partition, that's the reason i was wondering about quota. So finally, I am going to rewrite the table with scripting for every partition (quite hectic stuff). Just wondering whether it is worth to raise enhancement `Write to new destination table with partition column`. Thanks for your help. — Jaya Ananthram, Feb 02 '18 at 08:16
@JayaAnanthram this has been released today, you might be interested: https://cloud.google.com/bigquery/docs/creating-column-partitions — Pentium10, Feb 09 '18 at 07:35

Pentium10 · Accepted Answer · 2018-02-09T07:48:46.833

1

There has been a lot of discussion in the comments, but I want to highlight two things.

As of today you can create partitioned tables using an existing column TIMESTAMP or DATETIME type as the partition definer.
Also there is a guide pubished for Manually Changing Table Schemas

You may want to read both recommendations especially the first one come handy to you, and you would choose reloading data for lots of benefits.

edited Feb 09 '18 at 07:48

answered Feb 09 '18 at 07:39

Pentium10

204,586
122
423
502

Thanks for the update, really interesting. I hope it will help me. But I am afraid of beta warning in the documentation `This is a beta release of column-based, time-partitioned tables. This feature might be changed in backward-incompatible ways and is not subject to any SLA or deprecation policy.` (Since I want to do in my PROD environment).. Anyways I will have a look on it.. – Jaya Ananthram Feb 12 '18 at 10:16
Don't be afraid. Google has tens of projects still in beta. – Pentium10 Feb 12 '18 at 10:19

Changing the data type in BigQuery for a partitioned table

1 Answers1