0

I have just started a project with some fairly daunting requirements. Company A uses an application that writes records to Company B. The task is to move/update/sync very large amounts of data (195 tables and 2500 gb of data, millions and millions of rows) from an AWS RDS sql instance (Company B) which I believe is sql 2017, to an on premise instance (Company A) which is 2016. The acceptable threshold for latency is <=5 mins. We only have read access to the source and cannot install anything there so the traditional means of replication are not available.

There is 1 central table we'll call table A, which has a primary key defined (TableAID). The rest of the tables have a foreign key relationship to Table A, and they also have their own primary keys defined with other relationships between them. The "gotcha" to all of this is that when there is an update to the source, all data is dropped and reinserted thereby creating all new primary keys, with the exception of TableAID. So TableAID is the only primary key that persists and can be counted on. The other tables will still maintain their relationships, but with different primary keys after an update. This makes updating the target with deltas very difficult with all of the one to many relationships. In addition to this, Company B will archive data from time to time and we at Company A will have to sync existing data while retaining the data that was archived and no longer part of the data stream.

We explored using SSIS for this but can't get anywhere close to the latency expectations. After some digging in other forum topics I ran across a recommendation to use MS StreamInsight. I am not familiar with this, but if it will work as a means of real-time replication, I can get up to speed. I am not tied to any particular technology, but having said that, my background is with the MS toolset. Before I invest tons of time with StreamInsight, I would like to get an idea if this is a viable solution to my problem. Any other recommendations are also welcome!

Thanks!

Hadi
  • 36,233
  • 13
  • 65
  • 124
DaTaBaKeD
  • 1
  • 1
  • 1
    What do you mean by "latency"? Surely you are not expecting to copy all that data in 5 minutes. Or do you mean you'd like data updates replicated to the other database, with a maximum 5 minutes of lag? Also, when you refer to "sql instance", are you referring to Microsoft SQL Server? (I presume so, based on the tag.) Also, you say Company A wants to write to Company B, but then you say that the source is Company B? Have you looked at [AWS Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html)? – John Rotenstein Mar 11 '19 at 23:51
  • To copy 1Tb of data in five minutes you need a link and hardware either side that supports 3Mb per second. If you don't have that link you can't do that. The way this is normally done is that you do an initial "snapshot" that copies all the data and takes a while then after that you trickle changes through using some kind of "CDC" – Nick.Mc Mar 12 '19 at 00:02
  • 1
    Actually it's 8.3gb per second for 2.5TB. There is no way to achieve that kind of transfer speed and then restore them. The only thing that fast is detaching storage and attach on new location, but it's not a valid option for AWS to on-premises. – Nicolas Hevia Mar 12 '19 at 00:24
  • Yes i need the updates replicated with no more than a 5 minute lag. Company A writes data to company B via an application. When records are updated, the data must be replicated back to company so company b is the source in this scenario. And to reiterate, when a record is saved via the application at company a, company b drops the data and reinserts the data instead of doing an update. This process regenerates new surrogate keys for all of the supporting tables of the central table (Table A). This makes delta updates difficult since the keys that the updates would be based on have now changed. – DaTaBaKeD Mar 12 '19 at 05:42

0 Answers0