2

After doing a bit of research, I see that since Redshift doesn't support merge/upsert some people are using staging tables to update/insert records. Since Redshift also doesn't support procedures (triggers, etc.) does anyone have suggestions for how they've automated this process(lambda, data pipeline, etc.)?

1 Answers1

2

I don't think there is one right answer to this. Broadly there are two basic approaches to this I know of.

  1. An explicit step in your ETL that's performed as you're loading data (using this pattern)
  2. A batch cronjob that handles upserts periodically

Option 1 is probably the best approach, I believe. AWS Data Pipeline supports this with the RedshiftCopyActivity by using the insertMode field set to OVERWRITE_EXISTING.

At my job we've done this manually in our data pipeline by running the load command to a staging table and the upsert within the same sql transaction.

The other approach (option 2) is to have your ETL just appends new data to a staging table (you can use AWS Firehose for something like this), which you then use with a scheduled cronjob to update the target table on a scheduled basis. The upside of this is that your ETL is simpler, but the cronjob becomes another point of failure.

michael_erasmus
  • 906
  • 1
  • 9
  • 17
  • Thanks for the feedback, I appreciate it. Currently I have a lambda handling the Redshift upsert that responds to data uploaded to a S3 bucket. It seems to be working fine for now as there are only a few events per day. I'll definitely look into Data Pipeline as well. Cheers! – maynard1978 Mar 01 '18 at 02:14