Unloading & reloading data between S3 and Redshift with schema changes

Question

I'm interested in setting up some automated jobs that will periodically export data from our Redshift instance and store it on S3, where ideally it will then be bubbled back up into Redshift via an external table running in Redshift Spectrum. One thing I'm not sure of how to best deal with is the case of certain tables I'm working with changing in schema over time.

I'm able to both UNLOAD data from Redshift to S3 without a problem, and I'm also able to set up an external table within Redshift and have that S3 data available for querying. However, I'm not sure how to best deal with cases where our tables will change columns over time. For example, in the case of certain event data we capture through Segment, traits that get added will result in a new column on the Redshift table that won't have existed in previous UNLOADs. In Redshift, the column value for data that came in before the column existed will just result in NULL values.

What are best way to deal deal with this gradual change in data structure over time? If I just update the new fields in our external table will Redshift be able to deal with the fact that these fields don't necessarily exist on the older UNLOADs, or do I need to go some other route?

How often data structure changes happen? Can you create multiple tables for each `version` and union them with a view? — demircioglu, Nov 07 '18 at 19:41
I'm guessing you are unloading to CSV, but how you can handle this depends on the file format of the data so you might want to consider reformatting the data to AVRO or Parquet. This page has some useful guidance (is for Athena, not Redshift, but the advice still applies) https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html — Nathan Griffiths, Nov 08 '18 at 19:03
Did you find a solution for this? We're coming across the same issue. Attempting to create an archiving solution that can be restored on demand. If the parquet files are useless as soon as the table changes then we have an issue. — sacko87, Mar 22 '19 at 08:15

Unloading & reloading data between S3 and Redshift with schema changes

0 Answers0