0

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.

Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.

I'm trying to find a way to automate this process, and to achieve the following:

  1. Do a one time mirror in which all data is copied over from our production database to our performance database.
  2. Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
  3. During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
  4. During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).

The following are the tools I've been researching to assist me with this process:

  1. AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
  2. Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).

I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.

Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

Brian K
  • 548
  • 1
  • 4
  • 17
  • Is there a reason you aren't simply using MySQL Replication? – Bill Karwin Jun 06 '22 at 20:36
  • Reading your post more carefully, it makes more sense if you reverse "homogenous" and "heterogenous." I think you might be using these terms incorrectly. – Bill Karwin Jun 06 '22 at 20:43
  • There's no reason why I can't use MySQL Replication, just that I have more experience with AWS tools. Does MySQL Replication work if I only want to replicate over new data, and I want to manipulate the data a little before I actually mirror it over to the performance RDS? – Brian K Jun 06 '22 at 20:47
  • Sorry, I’m using “homogeneous” in the way that the AWS DMS developer guide defines it: data from two separate databases with the same engine and schema. – Brian K Jun 06 '22 at 20:47
  • 1
    MySQL replication can replicate to a table with the same name, same schema, and same columns, but the table may use a different storage engine, different indexes, or different partitioning. I don't know from your description what differences you have made in your "performance" schema. – Bill Karwin Jun 06 '22 at 20:58
  • 1
    Replication works incrementally. It logs insert/update/delete events on the source node, then those logs are downloaded to the replica and replayed there. – Bill Karwin Jun 06 '22 at 21:00
  • All indexes between the prod and perf databases are exactly the same, the only difference is the data itself - a few rows were added to the performance database for test purposes. Okay it sounds like replication will fit our use case then, I'll start looking into it, thanks – Brian K Jun 06 '22 at 21:02
  • I suggest making sure that you use row-based replication, not statement-based replication. Statement-based replication could accidentally apply updates or deletes to your extra data. There's also a risk of data collisions if you use auto-increment keys and your extra data is in the same tables as replicated tables. The safest thing to do is store the extra data in separate tables that don't exist at all on the source node. – Bill Karwin Jun 06 '22 at 21:31
  • Get Amazon to clone the server for you. They can do that "instantly". – Rick James Jun 06 '22 at 22:39

0 Answers0