Replication pipeline to replicate data from MySql RDS to Redshift

Question

My problem is here to create a replication pipeline that replicates tables and data from MySql RDS to Redshift and I cannot use any managed service. Also, any new updates in RDS should be replicated in the redshift tables as well.

After looking at my many solutions, I came to an understanding of the following steps:

Create flat files/CSVs from MySql RDS and save them in S3.
Use Redshift's COPY command to copy data in staging tables and then finally save it to the main tables.
Now, for the update part, every time I will push the CSVs to S3 and step 2 will be repeated.

So, I just wanted to confirm if the above approach is fine? As, every time when an update happens, will the old data be deleted completely and replaced by the new or is it possible to just update the necessary records. If yes, then how?

Any help will be really appreciated. Thanks in advance.

score 1 · Answer 1 · answered Apr 13 '20 at 05:05

Yes, above strategy is not just fine, its good. I use it in production system and it works great, though you have to careful and craft this strategy to make sure that it solves your use case effectively and efficiently.

Here is few points, what I mean by effectively and efficiently.

Make sure you have most efficient way to identify the records to be pushed to Redshift, meaning identify the potential records with optimized queries that includes CPU, Memory.
Make sure to use optimized way to send the identified to redshift that includes data size optimization, so that it uses minimum storage and network bandwidth. e.g. compress and gzip CSV files, so that it takes minimum size in S3 storage and save network bandwidth.
Try to run copy redshift queries in a way that it executes in parallel.

Hope this will help.

Replication pipeline to replicate data from MySql RDS to Redshift

1 Answers1