-1

I am trying to create an External table on top of Google Cloud Storage bucket through Composer DAG. My upstream provides me partitioned parquet files based on specific country. So, I would like to create an external table with Source Data partitioning enabled.

Sample GCS Path - gs://my-gs-bucket/folder1/subfolder/country=US/ (has multiple parquet files)

While creating the external table through console. Below are the options provided

Source:

Create table from -> Google Cloud Storage

Select file from GCS bucket or use a URI pattern -> gs://my-gs-bucket/folder1/subfolder/country*

File Format -> Parquet

Source Data Partitioning - Enabled

Select Source URI Prefix - gs://my-gs-bucket/folder1/subfolder

I would like to do this through Composer DAG. I was able to create an External table on the GCS files using BigQueryCreateExternalTableOperator but without Source Data Partitioning. Any idea how do we enable Source Data Partitioning and Select Source URI Prefix through DAG.

Sri Bharath
  • 115
  • 1
  • 2
  • 10

1 Answers1

1

There is no parameter to create partitioned External tables in BigQueryCreateExternalTableOperator.

For your requirement to create Partitioned table you can consider below sample code to run dag’s which uses BigQueryCreateEmptyDatasetOperator and BigQueryCreateEmptyTableOperator to create dataset and partitioned table with the time_partitioning parameter in BigQuery.

import os
import time
import datetime
from datetime import datetime, date, time, timedelta

from airflow import models
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryCreateEmptyDatasetOperator,
    BigQueryCreateEmptyTableOperator,
)

from airflow.utils.dates import days_ago

PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "your-project-id")
DATASET_NAME = os.environ.get("GCP_BIGQUERY_DATASET_NAME", "testdataset")

TABLE_NAME = "partitioned_table"

SCHEMA = [
    {"name": "value", "type": "INTEGER", "mode": "REQUIRED"},
    {"name": "ds", "type": "DATE", "mode": "NULLABLE"},
]

dag_id = "example_bigquery"


with models.DAG(
    dag_id,
    schedule_interval="@hourly",  # Override to match your needs
    start_date=days_ago(1),
    tags=["example"],
    user_defined_macros={"DATASET": DATASET_NAME, "TABLE": TABLE_NAME},
    default_args={"project_id": PROJECT_ID},
) as dag_with_locations:
    create_dataset = BigQueryCreateEmptyDatasetOperator(
        task_id="create-dataset", dataset_id=DATASET_NAME, project_id=PROJECT_ID
    )
    create_table = BigQueryCreateEmptyTableOperator(
        task_id="create_table",
        dataset_id=DATASET_NAME,
        table_id=TABLE_NAME,
        schema_fields=SCHEMA,
        time_partitioning={
            "type": "DAY",
            "field": "ds",
        },
    )
    
    create_dataset >> create_table 

You can also use schema from GCS, Example (with schema JSON in GCS):

CreateTable = BigQueryCreateEmptyTableOperator(
    task_id='BigQueryCreateEmptyTableOperator_task',
    dataset_id='ODS',
    table_id='Employees',
    project_id='internal-gcp-project',
    gcs_schema_object='gs://schema-bucket/employee_schema.json',
    bigquery_conn_id='airflow-service-account',
    google_cloud_storage_conn_id='airflow-service-account'
)

You can refer to this stack link to create external table for Parquet file using BigQueryCreateExternalTableOperator.

Prajna Rai T
  • 1,666
  • 3
  • 15