7

I'm trying to select the latest partition from a BigQuery table without scanning the whole table in a DBT model in order to save query costs.

DBT doesnt allow using semicolons in a data model so using the DECLARE+SET scripting statements doesn't work as suggested here.

DBT has a sql_header macro which allows setting some variables in the header but that header doesn't accept references to a data model or at least the following code is not compiling:

{{ config(
  sql_header="  DECLARE latest_partition_date DATE;
  DECLARE latest_load_timestamp TIMESTAMP;
  SET latest_partition_date = (SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} );
  SET latest_load_timestamp = (SELECT MAX(loaded_at) FROM {{ ref("model") }} WHERE _PARTITIONDATE = latest_partition_date);"
) }}

-- set the main query
SELECT * FROM {{ ref("model") }}
WHERE 
-- Select the latest partition to reduce 'Bytes processed' for loading the query.
_PARTITIONDATE = latest_partition_date
-- Select the latest load within the latest partition to get only one duplicate of data.
AND loaded_at = latest_load_timestamp

I need to solve this in standard SQL.

Other methods that were suggested included setting WHERE _PARTITIONDATE = CURRENT_DATE() or using DATE_SUB(CURRENT_DATE(), 3) but those don't satisfy because data load breakages are unpredictable and only dynamically selecting the latest would work here. Is that possible?

Edgar
  • 123
  • 1
  • 5

3 Answers3

2

You could do it in another query and get the result as a variable, something like this:

    {%- call statement('max_partition', fetch_result=True) -%}
      SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} )
    {%- endcall -%}

    {%- set max_date = load_result('max_partition')['data'][0][0] -%}

    SELECT * FROM {{ ref("model") }}
    WHERE 
    _PARTITIONDATE = {{ max_date }}
Javier Montón
  • 4,601
  • 3
  • 21
  • 29
  • When I run this i get the error message: `No matching signature for operator = for argument types: DATE, INT64. Supported signature: ANY = ANY`. It's probably because `_PARTITIONDATE` is type `DATE` but it seems to be evaluated as and INT. Any idea how to fix that? – Edgar Feb 19 '21 at 15:00
  • Then probably you need to cast it in order to retrieve the value, `SELECT CAST(DATE(MAX(_PARTITIONDATE)) AS STRING) FROM {{ ref("model") }}` probably will work to retrieve it, maybe you'll need to cast it again to use it in the main query. I have a similar example working using a `Timestamp` column, but I haven't tried with `_PARTITIONDATE` – Javier Montón Feb 19 '21 at 15:34
  • 1
    I tried casting but at compilation bgiquery just ignored that and still continued forcing the variable to be integer. the solution I got working required conversion within jinja – Edgar Feb 22 '21 at 16:06
2

Since the original question was working with dates, the right datatype conversion was the missing piece.

In the end I figured that the conversion to the right datatype needs to be done within jinja and not with SQL for the queries to accept right variables. Also, {{ max_date }} needed quotes.

The final solution that I got working was this:


{%- call statement('max_partition_date_query', True) -%}
  SELECT MAX(_PARTITIONDATE) as max_partition_date FROM {{ ref('model') }}
{%- endcall -%}

{%- set max_timestamp = load_result('max_partition_date_query')['data'][0][0] -%}
{%- set max_date = max_timestamp.strftime('%Y-%m-%d') -%}

select * FROM {{ ref('model') }}
WHERE _PARTITIONDATE = '{{ max_date }}'
Edgar
  • 123
  • 1
  • 5
1

The approaches above will work for https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time but not for https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables.

The other option is to utilise https://cloud.google.com/bigquery/docs/information-schema-tables to get the latest partition for the date/timestamp partitioned table.

{% set last_snapshot_at_query %}
SELECT 
    PARSE_TIMESTAMP("%Y%m%d", MAX(PARTITION_ID)) 
FROM
    {{ source("change-me-source", "INFORMATION_SCHEMA.PARTITIONS") }} 
WHERE
    table_name = "change-me-table"
{% endset %}

{% if execute %}
{% set last_snapshot_at = run_query(last_snapshot_at_query).columns[0][0] %}
{% endif %}

SELECT
    *
FROM
    {{ source('change-me-source', 'change-me-table') }},
WHERE 
    snapshot_at = "{{ last_snapshot_at }}"