I appreciate this may be quite trivial, but I am struggling to find an elegant solution.
Providing that I have access to modify the job configuration, through python in this case.
When I am invoking a load job through BigQuery Python API using schema autodetect to parse a CSV file into a BigQuery table.
Is it possible to ignore certain columns as a part of the load job?
For Example
I am creating a LoadJob sourced from the following CSV file (which I have formatted to make it easier on the eyes).
First_Name, Age, Gender
John, 26, Male
Is it possible to invoke a LoadJob using the Python BigQuery API which will produce the following table -
---------------- | Age | Gender | ---------------- | 26 | Male |
My current solution uses an external table which is as follows, creates an external table and uses SQL to filter and save the result as a new table, surely there is an easier way to do this via the bigquery.job.LoadJobConfig class.
# Configure external Table
uri = 'gs://my-bucket/myFile.csv'
table = bigQuery.Table('my-project.my-dataset.my-temporary-table')
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
uri
]
external_config.options.skip_leading_rows = 1
external_config.autodetect = True
table.external_data_configuration = external_config
# Create External Table
external_table = bq_client.create_table(table)
filtered_table = bigquery.Table('my-project.my-dataset-filtered-table')
filtered_table_config = bigquery.QueryJobConfig(destination=filtered_table)
sql =
f"""
SELECT Age, Gender
FROM {external_table.project}.{external_table.dataset_id}.{external_table.table_id}
"""
# Starting the query, passing in the extra configuration.
query_job = bq_client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # async, wait for response from API.
bq_client.delete_table(table=external_table, timeout=30) # Remove the external table