0

I have an ETL process that it's deleting a couple hundred thousand rows from a table with 18 billion rows using a unique hashed surrogate key like: 1801b08dd8731d35bb561943e708f7e3

delete from CUSTOMER_CONFORM_PROD.c360.engagement
            where (
                engagement_surrogate_key) in (
                select (engagement_surrogate_key)
                from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
            );

This is taking from 4 to 6 minutes each time on a Small warehouse. I have added a clustering key on the engagement_surrogate_key but since it's unique with high cardinality it didn't help. I have also enabled search optimization service but that also didn't help and it's still scanning all partitions. How can I speed up the deletion?

Luis Lema
  • 307
  • 1
  • 3
  • 10
  • 1
    Try the USING keyword as documented here https://docs.snowflake.com/en/sql-reference/sql/delete – NickW Feb 18 '23 at 17:45
  • Unfortunately, I can't change the delete statement as it's generated by the dbt incremental strategy. – Luis Lema Feb 18 '23 at 17:56
  • If you can’t change the delete statement then use a larger warehouse – NickW Feb 18 '23 at 18:30
  • I would try to override the SQL being generated, if possible. Also, are the records being deleted from a limited date range for a date-related field in the data? For example, if this is 5 years of data, but you only get deletes from within the last 3 months, or something like that? If so, you could cluster by day (or month) and then use that as part of the join condition. Having 2 fields in your join condition may also force dbt to change it's SQL statement to a USING for you. – Mike Walton Feb 19 '23 at 22:51
  • Unfortunately, dbt doesn't let you overwrite the statement. They are currently being deleted witouth a date range. That's a limitation in dbt and would be great if they can add the ability to add a date range so it doesn't scan the whole table. dbt recommends to hash all the columns in the join into a surrogate key and that's what I am using. Clustering the surrogate key doesn't help because it's unique and has high cardinality. – Luis Lema Feb 19 '23 at 23:58
  • I just found out dbt had implemented the functionality in December to limit the scan on the destination table by adding predicates (i.e. date range): https://github.com/dbt-labs/dbt-core/pull/5702 – Luis Lema Feb 21 '23 at 21:16

1 Answers1

0

The deletion can be speed up limiting the scan on the destination table by adding a date range, for example, filtering for only the past month worth of data: loaded_date>=dateadd(MM, -1, current_date). If you are using dbt they have implemented that functionality using this macro:

{% macro default__get_incremental_merge_sql(arg_dict) %}

  {% do return(get_merge_sql(arg_dict["target_relation"], arg_dict["temp_relation"], arg_dict["unique_key"], arg_dict["dest_columns"], arg_dict["predicates"])) %}

{% endmacro %}

So you can add the predicate to the dbt incremental model config like this:

{{ config(materialized= 'incremental', unique_key='engagement_surrogate_key', predicates=['loaded_date>=dateadd(M, -1, current_date)'])}}

When you run your model, the code generated will be this:

delete from CUSTOMER_CONFORM_PROD.c360.engagement
        where (
            engagement_surrogate_key) in (
            select (engagement_surrogate_key)
            from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
        )
                and loaded_date>=dateadd(MM, -1, current_date);
            
Luis Lema
  • 307
  • 1
  • 3
  • 10