20

Just testing out AWS Redshift, and having discovered some dupe data on an insert that I'd hoped would just fail on duplication in the key column, reading the docs reveal that primary key constraints aren't "enforced".

Anyone figured out how to prevent duplication on primary key (per "traditional" expectation).

Thanks to any Redshift pioneers!

altermativ
  • 690
  • 1
  • 6
  • 20
Saeven
  • 2,280
  • 1
  • 20
  • 33
  • I think whatever code your using to do the posting will need to first run a query to verify that a record with that key doesn't already exist. – Austin Mullins Mar 02 '13 at 04:30
  • I'm hopeful to avoid that. An easy circumstance to consider is a time dimension that's granular to the hour. I was hopeful to only insert rows that might get used on a dimensional query, but if I have to select and then insert for every possible hour that the data spans when those hours are being used by millions of rows - it's crazy. The fact table has a time-key, yyyy_mm_dd_hh - and this key links to the dimension table which expands the data (year, month, day, day of week, week, business quarter, etc.) Sure hope there's an alt... – Saeven Mar 02 '13 at 04:57
  • Saeven, did you ever figure out a solution? – mszaro Apr 03 '13 at 00:51
  • I resorted to using a MySQL table on a diff connection as an intermediary, weighing a set of factors with it before adding data into the dimension tables that were an issue. Not optimal from a cleanliness point of view, but, I had no choice given the business factors that influenced the decision. – Saeven Apr 03 '13 at 02:30
  • Damn, I was hoping to hear you'd come up with a pure AWS workaround. I may end up having to do something like that too. :( – mszaro Apr 11 '13 at 21:47

6 Answers6

9

I assign UUIDs when the records are created. If the record is inherently unique, I use type 4 UUIDs (random), and when they aren't I use type 5 (SHA-1 hash) using the natural keys as input.
Then you can follow this instruction by AWS very easily to perform UPSERTs. If your input has duplicates, you should be able to clean up by issuing a SQL that looks something like this in your staging table:

CREATE TABLE cleaned AS
SELECT
  pk_field,
  field_1,
  field_2,
  ...  
FROM (
       SELECT
         ROW_NUMBER() OVER (PARTITION BY pk_field order by pk_field) AS r,
       t.*
       from table1 t
     ) x
where x.r = 1
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
  • This may be an interesting way to recover salience to optimize query speed. Thanks for sharing Enno. I'd never thought of adding an intermediate maintenance step -- I don't know what kind of savings this'd produce on very large data steps. Easy enough to automate. – Saeven Jul 08 '13 at 16:39
  • Hi @Saeven! did you use this approach in production use? can you share your feedback? Or have you found any better alternative, thanks – BusinessAlchemist May 23 '19 at 10:34
6

If it is too late to add an identity column to use as rowid (ALTER won't allow you to add an IDENTITY column in Redshift) you can do this:

  • Fetch all dupe rows to a temporary table (use DISTINCT to get rid of dupes)
  • Delete these rows from the main table
  • Reinsert rows to the main table

Here's a sample: (let's assume id is your key to check dupes against, and data_table is your table)

CREATE TEMP TABLE delete_dupe_row_list AS
    SELECT t.id FROM data_table t WHERE t.id IS NOT NULL GROUP BY t.id HAVING COUNT(t.id)>1;
CREATE TEMP TABLE delete_dupe_rows AS
    SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id;
START TRANSACTION;
DELETE FROM data_table USING delete_dupe_row_list l WHERE l.id=data_table.id;
INSERT INTO data_table SELECT * FROM delete_dupe_rows;
COMMIT;
DROP TABLE delete_dupe_rows;
DROP TABLE delete_dupe_row_list;
Kemal
  • 2,602
  • 1
  • 21
  • 14
  • 1
    This appears like it should work (hence I'm guessing up-votes from people who didn't try it or used it on databases other than Redshift), but unfortunately it doesn't because the DISTINCT in the statement SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id; isn't honoured by Redshift - the result contains all the duplicates in the original table, which means that they all go into the delete_dupe_rows table and get re-inserted later on. – urchino Oct 28 '15 at 12:19
2

Confirmed, they don't enforce it:

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

For example, the query planner uses primary and foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins.

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

Amazon Redshift does enforce NOT NULL column constraints.

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

Community
  • 1
  • 1
tokenizer_fsj
  • 1,130
  • 13
  • 18
1

Yeah You can not do that. For the time being, I think you should just insert duplicate data(basically duplicate keys) with an extra column of timestamp. So it will have all versions of that particular row, since update is also an insert and while you query Redshift, make sure you pick the latest one.

GG.
  • 2,835
  • 5
  • 27
  • 34
1

A quick and dirty way is to use group by

select max(<column_a>), max(<column_a>), <pk_column1>, <pk_column2>
from <table_name>
group by <pk_column1>, <pk_column2>
Alex
  • 5
  • 1
  • 1
    With the sheer mass of this data set, tricking the data on the way out isn't an option -- the question is really concentric to keeping it clean on the way in. Thanks though... appreciate the whole square hole round peg. – Saeven Mar 13 '13 at 02:55
-1

I'm using IDENTITY to auto increment my primary key.

Here is a question I asked on the AWS forums:

https://forums.aws.amazon.com/message.jspa?messageID=450157#450157

Kirill Fuchs
  • 13,446
  • 4
  • 42
  • 72