0

I have a simple 9 column report that I'm sideloading into bigquery via the dbt-external-tables module.

version: 2

sources:
  - name: my_schema
    database: my_project
    loader: gcloud storage

    tables:
      - name: my_table_raw
        description: "external table of reports"
        external:
          location: 'gs://my_bucket/my_reports/*'
          auto_refresh: false
          options:
            format: csv
            skip_leading_rows: 1

Everything with this setup is fine so far, data imports correctly, queryable etc.

My simple addition: what is the simplest way to append the file name as a column?

I've attempted something to the effect of:

   - name: file_name_column
       expression: metadata$filename
       data_type: string
       description: "the source file name from within gcp"

But it looks like bigquery / the module is looking to match the number of columns on the table with the number of columns in the files because I am able to "create" the table in bigquery but get all kinds of errors when I query against it.

Update: from what I see here - the "metadata" expression that I've seen and am trying to use above is snowflake specific. Source link.

sgdata
  • 2,543
  • 1
  • 19
  • 44

1 Answers1

3

As I understand it, all BigQuery external tables pointing to Cloud Storage data have an additional pseudo-column _FILE_NAME (docs). There's no need to include it in your external table definition, you can simply query it downstream:

select *,
  _file_name as filename

from {{ source('my_schema', 'my_table_raw') }}

The pseudo-column approach is comparable to Snowflake's metadata$filename and Redshift's $path.

Jeremy Cohen
  • 951
  • 6
  • 7
  • As C3P0 says, "Thank the maker!" – sgdata Feb 11 '21 at 16:07
  • Is there any option to change the csv delimiter as well? – sgdata Feb 11 '21 at 16:53
  • Absolutely! The `dbt-external-tables` YAML spec is really just a passthrough for BigQuery's `create external table` DDL. In the [docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement), it looks like `field_delimiter` is supported as one of the `options`. So your YAML `options` dict could include `field_delimiter` alongside `format` and `skip_leading_rows`. – Jeremy Cohen Feb 12 '21 at 10:43