Amazon Redshift Keys are not enforced - how to prevent duplicate data?

Question

Just testing out AWS Redshift, and having discovered some dupe data on an insert that I'd hoped would just fail on duplication in the key column, reading the docs reveal that primary key constraints aren't "enforced".

Anyone figured out how to prevent duplication on primary key (per "traditional" expectation).

Thanks to any Redshift pioneers!

I think whatever code your using to do the posting will need to first run a query to verify that a record with that key doesn't already exist. — Austin Mullins, Mar 02 '13 at 04:30
I'm hopeful to avoid that. An easy circumstance to consider is a time dimension that's granular to the hour. I was hopeful to only insert rows that might get used on a dimensional query, but if I have to select and then insert for every possible hour that the data spans when those hours are being used by millions of rows - it's crazy. The fact table has a time-key, yyyy_mm_dd_hh - and this key links to the dimension table which expands the data (year, month, day, day of week, week, business quarter, etc.) Sure hope there's an alt... — Saeven, Mar 02 '13 at 04:57
I resorted to using a MySQL table on a diff connection as an intermediary, weighing a set of factors with it before adding data into the dimension tables that were an issue. Not optimal from a cleanliness point of view, but, I had no choice given the business factors that influenced the decision. — Saeven, Apr 03 '13 at 02:30
Damn, I was hoping to hear you'd come up with a pure AWS workaround. I may end up having to do something like that too. :( — mszaro, Apr 11 '13 at 21:47

Enno Shioji · Answer 1 · 2015-02-26T16:26:49.800

9

I assign UUIDs when the records are created. If the record is inherently unique, I use type 4 UUIDs (random), and when they aren't I use type 5 (SHA-1 hash) using the natural keys as input.
Then you can follow this instruction by AWS very easily to perform UPSERTs. If your input has duplicates, you should be able to clean up by issuing a SQL that looks something like this in your staging table:

CREATE TABLE cleaned AS
SELECT
  pk_field,
  field_1,
  field_2,
  ...  
FROM (
       SELECT
         ROW_NUMBER() OVER (PARTITION BY pk_field order by pk_field) AS r,
       t.*
       from table1 t
     ) x
where x.r = 1

edited Feb 26 '15 at 16:26

answered Jul 07 '13 at 08:26

Enno Shioji

26,542
13
70
109

This may be an interesting way to recover salience to optimize query speed. Thanks for sharing Enno. I'd never thought of adding an intermediate maintenance step -- I don't know what kind of savings this'd produce on very large data steps. Easy enough to automate. – Saeven Jul 08 '13 at 16:39
Hi @Saeven! did you use this approach in production use? can you share your feedback? Or have you found any better alternative, thanks – BusinessAlchemist May 23 '19 at 10:34

Kemal · Answer 2 · 2014-03-05T12:21:11.233

If it is too late to add an identity column to use as rowid (ALTER won't allow you to add an IDENTITY column in Redshift) you can do this:

Fetch all dupe rows to a temporary table (use DISTINCT to get rid of dupes)
Delete these rows from the main table
Reinsert rows to the main table

Here's a sample: (let's assume id is your key to check dupes against, and data_table is your table)

CREATE TEMP TABLE delete_dupe_row_list AS
    SELECT t.id FROM data_table t WHERE t.id IS NOT NULL GROUP BY t.id HAVING COUNT(t.id)>1;
CREATE TEMP TABLE delete_dupe_rows AS
    SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id;
START TRANSACTION;
DELETE FROM data_table USING delete_dupe_row_list l WHERE l.id=data_table.id;
INSERT INTO data_table SELECT * FROM delete_dupe_rows;
COMMIT;
DROP TABLE delete_dupe_rows;
DROP TABLE delete_dupe_row_list;

This appears like it should work (hence I'm guessing up-votes from people who didn't try it or used it on databases other than Redshift), but unfortunately it doesn't because the DISTINCT in the statement SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id; isn't honoured by Redshift - the result contains all the duplicates in the original table, which means that they all go into the delete_dupe_rows table and get re-inserted later on. — urchino, Oct 28 '15 at 12:19

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

Confirmed, they don't enforce it:

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

For example, the query planner uses primary and foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins.

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

Amazon Redshift does enforce NOT NULL column constraints.

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

score 1 · Answer 4 · answered Mar 08 '13 at 19:45

Yeah You can not do that. For the time being, I think you should just insert duplicate data(basically duplicate keys) with an extra column of timestamp. So it will have all versions of that particular row, since update is also an insert and while you query Redshift, make sure you pick the latest one.

score 1 · Answer 5 · answered Mar 12 '13 at 16:20

1

A quick and dirty way is to use group by

select max(<column_a>), max(<column_a>), <pk_column1>, <pk_column2>
from <table_name>
group by <pk_column1>, <pk_column2>

answered Mar 12 '13 at 16:20

Alex

5
1

1

With the sheer mass of this data set, tricking the data on the way out isn't an option -- the question is really concentric to keeping it clean on the way in. Thanks though... appreciate the whole square hole round peg. – Saeven Mar 13 '13 at 02:55

score -1 · Answer 6 · answered May 14 '13 at 22:09

-1

I'm using IDENTITY to auto increment my primary key.

Here is a question I asked on the AWS forums:

https://forums.aws.amazon.com/message.jspa?messageID=450157#450157

answered May 14 '13 at 22:09

Kirill Fuchs

13,446
4
42
72

2

This doesn't answer the question. – Paul Bellora Sep 27 '13 at 04:05

Amazon Redshift Keys are not enforced - how to prevent duplicate data?

6 Answers6

Linked