I'm currently working on Datawarehousing some Google Searchconsole data into a bigquery.
Since GSC is a lot of data I wanna use incremental models for each day. My models look something like this:
with
raw_data as (
select * from {{source('searchconsole', 'searchconsole')}}
{%- if is_incremental() -%}
-- this filter will only be applied on an incremental run
where data_date > (select max(data_date) from {{ this }})
{%- endif -%}
),
further_transformations as (
select * from ...
)
select * from further_transformations
And their config then looks like this:
dbt_project.yml:
(...)
models:
project_name:
+materialized: incremental
+incremental_strategy: insert_overwrite
Local properties.yml:
version: 2
models:
- name: model_xy
description: Description for model xy
config:
partition_by:
field: 'data_date'
data_type: 'date'
granularity: 'day'
The output from DBT states, that on an incremental run, data is processed in the order of 10s of GB:
12:35:32 1 of 6 START sql incremental model model_xy .......... [RUN]
12:36:13 1 of 6 OK created sql incremental model model_xy ..... [SCRIPT (32.0 GB processed) in 40.68s]
This seems way too much for a run, where in the end, no rows should be added (when I run it right now). There seems to be no significant benefit between a full refresh run and an incremental run. The incremental run even takes longer. From my best understanding of incremental models this is exactly how it should be done, am I missing something? Is there a problem with filtering the data early like this in CTEs?
I've tried to look for people with similar Problems, but it seems like people just generally aren't sure how to use incremental models.