0

I would like to include a condition in my Snowflake task to run only if a specified table has data in it. This would be similar to task condition:

WHEN SYSTEM$STREAM_HAS_DATA('my_schema.my_table')

Except I do not wish to use a stream. The problem with using a stream in some cases, is that streams can go stale. I have tables in my ELT process that may not receive updates for weeks or months. Possibly even years.

One thought I had was to use a UDF in the task condition:

WHEN PUBLIC.TABLE_HAS_DATA('my_schema.my_table')

This would be great if I could throw a SELECT CAST(COUNT(1) AS BOOLEAN) FROM "my_schema"."my_table" in there. But a SQL UDF will not be able to do anything with a table name that is passed as a parameter. And a Javascript UDF seems too limiting when it comes to querying tables.

Admittedly, I am not a Javascript programmer. Nor am I too familiar with Snowflake's Javascript UDF abilities. I can perform the desired queries in a Javascript Stored Proc just fine. But those don't seem to translate over to UDFs.

Paul M
  • 351
  • 3
  • 14

3 Answers3

0

Snowflake Streams should only go stale if you don't do something with the data within its set retention period. As long as you have a task to process data in the stream (change records) when they show up you should be fine. So if you don't see a change show up in a Stream for 6 months, that's fine as long as you process that change record within your data retention period (14 days as an example).

If your task has a STREAM_HAS_DATA condition and the stream doesn't get data for 14 days, the stream will go stale because a stream's offset is only updated when it is queried. You can work around this issue by removing the condition and letting the task run more often.

CodeMonkey
  • 413
  • 4
  • 15
  • That's what I thought. However, I've got streams going stale and my ELT tasks are failing as a result. When I dive into the data, there is a fair amount of evidence to suggest that - if the underlying table receives 0 CRUDs for 14 days, the stream goes stale. 14 days is the overriding retention period for infrequently used tables. – Paul M Aug 13 '20 at 17:02
  • Interesting, I don't know that I've run into a scenario where i've had a stream not get data for 14 days. I'll be sure to test this out myself! – CodeMonkey Aug 13 '20 at 18:49
0

The SYSTEM$STREAM_HAS_DATA does only apply to streams https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html.

As streams can get stale, we can check (since Snowflake 5.1.x which was released last Jan 2021) the stale_after timestamp property returned by the SHOW STREAMS command so that we can promptly re-create streams that are about to get stale.

A solution to retrieve stale streams is provided here: Snowflake - How can I query the stream's metadata and save to table

0

We had the same issue in my Org. We have a shared db setup by a provider that built an app in which users have to upload submissions that get sent to snowflake in the backend. The submissions time are infrequent and depend on the org. Also our account data retention is 1 day so that was putting us off from using Streams. However, we setup a general task that extends the date of empty streams like this

CREATE OR REPLACE TASK SYSTEM_EXTEND_STREAM_STALENESS
    USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE  = 'XSMALL'
    SCHEDULE = 'USING CRON 0 0 * * * UTC'
    COMMENT = 'This task is used to increase the staleness date of infrequent data sources to prevent them going stale before new CRUD operations appear. The schedule should match the account data retention setup.'
AS
DECLARE
    RES RESULTSET;
    SOURCE STRING;
BEGIN
    SHOW STREAMS IN ACCOUNT;
    RES := (SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID())) WHERE NOT("stale"));

    FOR R IN RES DO
        SOURCE := CONCAT(R."database_name", '.', R."schema_name", '.', R."name");
        SELECT 1 FROM IDENTIFIER(:SOURCE) LIMIT 1;
        IF (NOT EXISTS(SELECT 1 FROM TABLE(RESULT_SCAN(LAST_QUERY_ID())) LIMIT 1)) THEN
            CREATE OR REPLACE TEMP TABLE STREAM_DUMP AS SELECT * FROM IDENTIFIER(:SOURCE);
        END IF;
    END FOR;
END;

So if your Data Retention is the default 14 days, change the CRON schedule to match 14 days or less. What this does is every period it checks all streams in your account (that are not already stale) and if there's no data in it it does a dump DML operation to increase the staleness date. So now you can use your other tasks with WHEN SYSTEM$STREAM_HAS_DATA safely. I just wish Snowflake would do this out of the box. I get they can't do it for streams that actually have data as it would just increase the data size but for empty streams it shouldn't have to be so difficult.

Andrei Budaes
  • 591
  • 7
  • 22