I have a DataPipeline that exports data from a local DB to Redshift via S3 (very similar to Incremental copy of RDS MySQL table to Redshift template). I have defined primary key and set insertMode to "OVERWRITE_EXISTING" in pipeline definition, however, I noticed that some rows eventually were duplicated. In what cases does it happen and how do I prevent it?
Asked
Active
Viewed 569 times
2 Answers
1
In Redshift it wont enforce primary key to restrict duplicate values. We do use temp table to load incremental data then we do upsert(using merge) to target table by checking whether record exist or not.
In this way you can achieve.
Thanks!!

Chennakrishna
- 181
- 6
-
There was ultimately a reason to what I observed, but you're right: since Redshift does not enforce primary keys, staging and doing upsert manually is the way to control changes. – yana Feb 05 '21 at 22:15
0
Just found this post after several years, adding an answer in case it helps someone else:
In addition to primary keys Redshift also uses distkeys to determine which lines to overwrite. So in my case an updated value in distkey column forced Redshift to create a duplicate row, although the primary key remained unchanged.

yana
- 33
- 1
- 5
-
Redshift does not enforce uniqueness constraints (primary key or otherwise). See: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html Also, distkeys are used to assign rows to slices within the cluster and have no uniqueness requirement. – Bill Weiner Feb 06 '21 at 16:16
-
It's certainly true, but the important (and not so obvious) thing for me was that `insertMode: "OVERWRITE_EXISTING"` in pipeline definition resulted in Redshift actually looking at values in distkey field to decide whether rows with same primary key would be overwritten or appended. – yana Feb 13 '21 at 22:58