0

I have a bucket that has 28 files and I would like to load them in different tables (one table per file in the bucket).

Each file in the bucket looks like this:

file_1_2020.csv
file_2_2020.csv
..
file_28_2020.csv

I would like that the table had the first 6 characters of the filename + _table(they are all csv with headers). For example, file_1_table.

I read in a blog that I could achieve this by creating a federated query that list all files in the bucket. Then I could loop through all names and create the tables I need. The thing is that all documentation I have read on GCP is about federating queries in Cloud SQL, so I don't really know how I can make BigQuery read my bucket.

How can I achieve this? It's not important if it is done with a federated query (but I would like to know, because it's always good to know and learn new things). Also, I would like to have the tables partitioned and clustered by datetime, which is the first column of each file (the schema of each file is datetime:TIMESTAMP,col1:FLOAT,col2:FLOAT)

2 Answers2

0

You can use EXECUTE IMMEDIATE to achieve this:

Step-1: Create an external table to get list of files in storage bucket

CREATE EXTERNAL TABLE mydataset.mytable OPTIONS (format = 'CSV', uris = ['gs://bucket_name/*.csv']);

Step-2: Use EXECUTE IMMEDIATE to create tables dynamically

DECLARE FILE_LIST ARRAY<STRING>;
DECLARE TABLE_NAME STRING;
DECLARE I INT64 DEFAULT 1;
DECLARE CNT INT64 DEFAULT 0;
SET FILE_LIST = ARRAY(SELECT DISTINCT _FILE_NAME as FILENAME FROM mydataset.mytable);
SET CNT = ARRAY_LENGTH(FILE_LIST);
WHILE I <= CNT 
DO
  SET TABLE_NAME = CONCAT(SUBSTR(REPLACE(SUBSTR(FILE_LIST[ORDINAL(i)], INSTR(FILE_LIST[ORDINAL(i)], '/', -1) + 1) ,'.csv', ''), 1, 6), '_table');
  EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE mydataset." || TABLE_NAME || " OPTIONS (format = 'CSV', uris = ['" || FILE_LIST[ORDINAL(I)] || "'])";
  SET I = I + 1;
END WHILE;
Soumendra Mishra
  • 3,483
  • 1
  • 12
  • 38
  • Hi, thank you for your answer. The first step worked fine :). The second one, I got an error that says: `Invalid EXECUTE IMMEDIATE sql string `CREATE EXTERNAL TABLE mydataset.file_1_2020_table OPTIONS (format = 'CSV', uris = ['gs://bucket_name/file_1_2020.csv'])`, Syntax error: Expected keyword OPTIONS at [13:21]` Plus, its getting the full name of the csv file, I just need the 6 first characters of the csv file. The real name of the file could be `file_1_with_a_very_long_name_2020.csv`. And also, the execution of the step 2 query got billed for 22.9 GB :(. – Pedro Pablo Severin Honorato Nov 27 '20 at 16:40
  • Something I might have pointed out, is that the sum of the size of all files is around 22.9 GB – Pedro Pablo Severin Honorato Nov 27 '20 at 16:40
  • Thank you. But why was that error? I haven't executed the query yet, since I want to avoid getting billed with that error. How can I fix it? – Pedro Pablo Severin Honorato Nov 27 '20 at 17:47
0

You could leverage Cloud Workflows a new Google Cloud product for this use case.

In a nutshell, Workflows allows you to connect services together, anything that has a public API. Integrate any Google Cloud API, SaaS API, or private APIs.

Essentially you need to connect Cloud Storage with BigQuery to load the files.

enter image description here

You will need to have the following steps defined in Workflow:

  • use Cloud Storage API to fetch files from your bucket/folder
  • extract segments from filename eg: 12 and 2020 into variables for later use
  • use BigQuery Load API to send the input sourceURIs obtained in first step combining segments from step 2 for your destination name format

There is a full blog post I have about this: Using Cloud Workflows to load Cloud Storage files into BigQuery

Pentium10
  • 204,586
  • 122
  • 423
  • 502