DBT incremental models process too much data

Question

I'm currently working on Datawarehousing some Google Searchconsole data into a bigquery.

Since GSC is a lot of data I wanna use incremental models for each day. My models look something like this:

with
  raw_data as (
    select * from {{source('searchconsole', 'searchconsole')}}
    {%- if is_incremental() -%}

    -- this filter will only be applied on an incremental run
    where data_date > (select max(data_date) from {{ this }})

    {%- endif -%}
  ),

  further_transformations as (
    select * from ...
  )

select * from further_transformations

And their config then looks like this:

dbt_project.yml:

(...)

models:
  project_name:
    +materialized: incremental
    +incremental_strategy: insert_overwrite

Local properties.yml:

version: 2

models:
  - name: model_xy
    description: Description for model xy
    config:
      partition_by:
        field: 'data_date'
        data_type: 'date'
        granularity: 'day'

The output from DBT states, that on an incremental run, data is processed in the order of 10s of GB:

12:35:32  1 of 6 START sql incremental model model_xy .......... [RUN]
12:36:13  1 of 6 OK created sql incremental model model_xy ..... [SCRIPT (32.0 GB processed) in 40.68s]

This seems way too much for a run, where in the end, no rows should be added (when I run it right now). There seems to be no significant benefit between a full refresh run and an incremental run. The incremental run even takes longer. From my best understanding of incremental models this is exactly how it should be done, am I missing something? Is there a problem with filtering the data early like this in CTEs?

I've tried to look for people with similar Problems, but it seems like people just generally aren't sure how to use incremental models.

score 0 · Answer 1 · answered Apr 19 '23 at 10:00

0

I believe that is not the way to configure an incremental model.

As far as I am concerned, you have two ways:

In the model itself:

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite'
    partition_by={
      "field": "data_date",
      "data_type": "date",
      granularity: 'day'
    },
    ...
  )
}}

select ...

In your dbt_project.yml:

models:
  (...)
    your_model:
      +incremental_strategy: "insert_overwrite"

I personally always do it at the model level, so the first option is my go-to.

answered Apr 19 '23 at 10:00

Aleix CC

1,601
1
7
18

I've edited the question to better show, how it really is configured. The config is definitely used, because f.e. when I omit the partitioning, I get some errors. Personally, I like to put all the config in a properties.yml directly next to my models, but I guess that's just preference. Thanks for the answer anyways :) – Steffen Schubert Apr 20 '23 at 08:45

score 0 · Accepted Answer · answered Apr 20 '23 at 09:09

While I'm still confused as to why dbt processes these huge amounts of data, this seems to be correct.

On the other hand I've found another gotcha, that was rather stupid from me:

Since online articles always told me to filter the new increment as early as possible and my models all depend on each other, I assumed that I could just filter the data in an upstream model and this would also filter data for downstream models.

This of course is wrong, because in downstream models you still use the entire table and not just the new partition that was added in the current run.

Implementing the is_incremental filter on every model therefore increased the performance by a mile.

A last general tip from my side is to create a model that manages the filter condition for the latest partition, e.g. the maximum date of the partitions in your DB. Every other model can then use this model to filter for an incremental run. This helped me in certain cases, when a model doesn't have a column to look up the latest existing partition, e.g. in a relation table that consists of two foreign keys.

To make sure that this model runs before every other model, but doesn't create a cycle I configured one of my fact tables as a source (Might cause error on the first run, when fact table doesn't exist yet). Then I referenced the model in my most upstream model within a comment like this:

-- depends_on: {{ ref('incremental_log') }}

score 0 · Answer 3 · answered Jul 21 '23 at 04:33

While I'm also confused about why dbt seems to scan the whole table instead of just the incremental part as expected, I found a workaround.

Firstly, I realize that changing this part (select max(data_date) from {{ this }}) to a hard-coded value: 2023-07-21 makes the amount of processed data become much smaller as expected.

Based on that, we can use dbt run_query macro to fetch the max(data_date) and set macro to store that fetched result into a variable. Then we can provide the value as dbt jinjja variable.

Example:

{{
    config(
        materialized='incremental',
        partition_by={
          "field": "data_date",
          "data_type": "date",
          "granularity": 'day'
        },
    )
}}

{% if is_incremental() %}
{% set max_data_date_query %}
select format_date("%Y-%m-%d", max(data_date)) from {{ this }}
{% endset %}
{% set max_data_date_query_result = run_query(max_data_date_query) %}
{% set max_data_date = max_data_date_query_result.columns[0].values()[0] %}
{% endif %}

with
  raw_data as (
    select * from {{source('searchconsole', 'searchconsole')}}
    {%- if is_incremental() -%}

    -- this filter will only be applied on an incremental run
    where data_date > "{{ max_data_date }}"

    {%- endif -%}
  ),

  further_transformations as (
    select * from ...
  )

select * from further_transformations

DBT incremental models process too much data

3 Answers3