AWS DataPipeline: RedshiftCopyActivity OVERWRITE_EXISTING not enforcing primary key

Question

I have a DataPipeline that exports data from a local DB to Redshift via S3 (very similar to Incremental copy of RDS MySQL table to Redshift template). I have defined primary key and set insertMode to "OVERWRITE_EXISTING" in pipeline definition, however, I noticed that some rows eventually were duplicated. In what cases does it happen and how do I prevent it?

score 1 · Answer 1 · answered Jul 20 '16 at 12:55

1

In Redshift it wont enforce primary key to restrict duplicate values. We do use temp table to load incremental data then we do upsert(using merge) to target table by checking whether record exist or not.

In this way you can achieve.

Thanks!!

answered Jul 20 '16 at 12:55

Chennakrishna

181
6

There was ultimately a reason to what I observed, but you're right: since Redshift does not enforce primary keys, staging and doing upsert manually is the way to control changes. – yana Feb 05 '21 at 22:15

score 0 · Accepted Answer · answered Feb 05 '21 at 22:09

0

Just found this post after several years, adding an answer in case it helps someone else:

In addition to primary keys Redshift also uses distkeys to determine which lines to overwrite. So in my case an updated value in distkey column forced Redshift to create a duplicate row, although the primary key remained unchanged.

answered Feb 05 '21 at 22:09

yana

33
1
5

Redshift does not enforce uniqueness constraints (primary key or otherwise). See: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html Also, distkeys are used to assign rows to slices within the cluster and have no uniqueness requirement. – Bill Weiner Feb 06 '21 at 16:16
It's certainly true, but the important (and not so obvious) thing for me was that `insertMode: "OVERWRITE_EXISTING"` in pipeline definition resulted in Redshift actually looking at values in distkey field to decide whether rows with same primary key would be overwritten or appended. – yana Feb 13 '21 at 22:58

AWS DataPipeline: RedshiftCopyActivity OVERWRITE_EXISTING not enforcing primary key

2 Answers2