1

Is there an easy/efficient way to create surrogate keys in Snowflake?

enter image description here

Imagine this data set is going to be selected into in a table, during the insertion a battery_id column is added, which is the battery_uuid column mapped to a surrogate key. In this case specific case that could become e.g. [1,1,2,1,1,2,3]. Sequences in Snowflake do not have this functionality, because that becomes e.g. [1,2,3,4,5,6,7]. It is preferable if this mapping is done during the insertion, however, it could also be done after insertion if necessary. As context: This column is added because joins on UUID's are not performant.

JPlanken
  • 55
  • 1
  • 9
  • Any reason why you don't want it to be unique because by definition surrogate key is a unique generated value. If you want you can store that serival_number in a different table and have a unique id stored there. – Himanshu Kandpal Mar 20 '21 at 14:11

3 Answers3

1

If your goal is to turn a string into a number for the purposes of join performance, I recommend leveraging a HASH function. In your case, you could simply create a new column and update the values to be the HASH(battery_uuid) to create a surrogate key. And then leverage that new column for your joins. If your natural key is more than 2 columns, this will still work for you, since the HASH function allows for more than a single column to be used.

https://docs.snowflake.com/en/sql-reference/functions/hash.html

Mike Walton
  • 6,595
  • 2
  • 11
  • 22
1

This is how you can do it

CREATE TABLE thetable (event_id string, batt_id string, serial_number string, batt_key number);
INSERT INTO thetable VALUES('1','bat1','1',0);
INSERT INTO thetable VALUES('2','bat1','1',0);

so with a table in the basic form you want, with some dummy rows to show "pre-existing data that we want to use the existing key based on battery_id. I shrank the column names for typing reasons, and using number string instead of UUID's, but point still hold.

CREATE OR REPLACE SEQUENCE batt_key_seq;

WITH new_data(event_id, batt_id, serial_number) AS (
    SELECT * FROM VALUES 
        ('11','bat1', '10'),
        ('12','bat1', '10'),
        ('13','bat2', '11'),
        ('14','bat1', '10'),
        ('15','bat1', '10'),
        ('16','bat2', '11'),
        ('17','bat3', '12')
), new_data_with_keys AS (
    SELECT d.batt_id
        ,COALESCE(t.batt_key, batt_key_seq.nextval) AS batt_key
    FROM (
        SELECT DISTINCT batt_id 
        FROM new_data
    ) AS d
    LEFT JOIN thetable AS t 
        ON d.batt_id = t.batt_id
    QUALIFY row_number() OVER (PARTITION BY d.batt_id ORDER BY event_id) = 1
)
SELECT n.*, k.batt_key
FROM new_data AS n
JOIN new_data_with_keys AS k 
    ON n.batt_id = k.batt_id
;

so the ROW_NUMBER() part is to capture only one of the existing rows (thus the double data insert in the exist data at the start), I feel like this might be cleaner some other way, but it getting the job done.

But in effect you get you keys, you find the existing keys, and give new keys to the new data, and then can insert the data in this enriched form.

what I am noticing in testing is that my key is jumping by 4 each run, and swapping to

,IFF(t.batt_key is not null, t.batt_key, batt_key_seq.nextval) AS batt_key

did not fix that.

So separating the key finding parts like

WITH new_data(event_id, batt_id, serial_number) AS (
    SELECT * FROM VALUES 
        ('11','bat1', '10'),
        ('12','bat1', '10'),
        ('13','bat2', '11'),
        ('14','bat1', '10'),
        ('15','bat1', '10'),
        ('16','bat2', '11'),
        ('17','bat3', '12')
), missing_keys AS (
    SELECT d.batt_id
        ,batt_key_seq.nextval AS batt_key
    FROM (
        SELECT DISTINCT batt_id 
        FROM new_data
    ) AS d
    LEFT JOIN thetable AS t 
        ON d.batt_id = t.batt_id
    WHERE t.batt_id IS NULL
), old_keys AS (
    SELECT distinct d.batt_id
        ,t.batt_key
    FROM (
        SELECT DISTINCT batt_id 
        FROM new_data
    ) AS d
    LEFT JOIN thetable AS t 
        ON d.batt_id = t.batt_id
    WHERE t.batt_id IS NOT NULL
), new_data_with_keys AS (
    SELECT * FROM missing_keys
    UNION ALL
    SELECT * FROM old_keys
)
SELECT n.*, k.batt_key
FROM new_data AS n
JOIN new_data_with_keys AS k 
    ON n.batt_id = k.batt_id
;

now the new keys are not skipping/wasted.

The performance difference of the double joins verse keys space wastage is something you might need to test and or understand how much new verse existing data you will see and thus how much wastage happens. If the performance of the second one is vastly different you can always re-key the data later, albeit in these keys are used outside the table that can be a lot of data to update later.

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
0

You can use sequence in Snowflake.This is one way you can try out implementing surrogate keys. Create it for your schema and you can use it against any table under the schema. eg below

https://docs.snowflake.com/en/sql-reference/sql/create-sequence.html#examples