0

I am trying to create an external table in Big Query for a Parquet file that is present on the GCS bucket. But when I am running the below code in airflow getting an error:

ERROR:

[2023-07-04, 10:03:44 UTC] {taskinstance.py:1770} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/bigquery.py", line 1712, in execute
    table = bq_hook.create_empty_table(
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 468, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 413, in create_empty_table
    return self.get_client(project_id=project_id, location=location).create_table(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 779, in create_table
    api_response = self._call_api(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 813, in _call_api
    return call()
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/idmp-mii-dev-ddb5/datasets/tivo_site_activity_0/tables?prettyPrint=false: CsvOptions can only be specified if storage format is CSV.

DAG CODE:

create_imp_external_table = BigQueryCreateExternalTableOperator(
    task_id=f"create_imp_external_table",
    bucket='my-bucket',
    source_objects=["/data/userdata1.parquet"], #pass a list
    destination_project_dataset_table=f"my-project.my_dataset.parquet_table",
    source_format='PARQUET', #use source_format instead of file_format
)

Composer version: 2.3.2

Airflow version: 2.5.1

GURUDAS K S
  • 63
  • 1
  • 10

1 Answers1

0

I am in the same situation as you. There seems to be a problem with the current apache-airflow-provider-google library when creating external tables with parquet files. I am using version 10.4.0

You can try to use table_resource parameter as temporary quick-easy fix solution.

DAG CODE

source_objects = ["/data/userdata1.parquet"]
create_imp_external_table = BigQueryCreateExternalTableOperator(
    task_id=f"create_imp_external_table",
    table_resource={
        "tableReference": {
            "projectId": my-project,
            "datasetId": my_dataset,
            "tableId": parquet_table,
        },
        "externalDataConfiguration": {
            "source_uris": [f"gs://{my-bucket}/{source_object}" for source_object in source_objects],
            "source_format": "PARQUET",
            "autodetect": True,
        }
    }
)
andikapr
  • 1
  • 1