4

I’ve been tasked with implementing a system for continuously receiving large amounts of data from a partner. The data schema on the partners’ side is different from ours, so some transformation has to take place when the data is received and imported into our database.

What’s the best approach for solving this task?

I have some control over both sides of the equation, i.e. if we chose to use WCF, our partner will implement that service. But it’s important to limit the amount of programming that has to take place on the partners’ side as much as possible. Ideally they would use some feature of SQL Server to export their data and then let us deal with the rest.

Other relevant factors:

  • SQL Server is used on both sides (2008 at our side, unknown for the partner).
  • .NET 4 and/or anything that comes out of the box from Microsoft, no 3rd party products.
  • Transfer is one way, i.e. from the partner to us only.
  • Amount of data is around tens of thousands of updated records/objects transferred daily.
  • The data model is fairly complicated, with several tables and relations.
  • The solution should be tolerant to changes on the partner's side (we don’t know when & how they change their model)
  • Maintainability and reliability are more important than performance, i.e. we don't need the latest data, but the service should be easy to modify and it can't fail/stop.

I’m asking this question because I overwhelmed but the number of ways this can be accomplished. As a .NET developer, I’m leaning towards implementing WCF or maybe even some custom XML-based exchange. But I know BizTalk would probably fit the bill here, and SQL Server probably has some nifty features built in that I don’t know about.

Any pointers, ideas, and suggestions are most welcome!

Darren
  • 138
  • 1
  • 8
Jakob Gade
  • 12,319
  • 15
  • 70
  • 118
  • 1
    What is large amounts of data? 100GB/hour? How fast must the replication be. Is it supposed to be some kind of scheduled bulk load or, more event driven/per record basis? – Albin Sunnanbo Aug 25 '10 at 06:34
  • Sorry, by "large" I meant "tens of thousands of records pr. day", which will require some processsing, but hardly put much of strain on our bandwidth. Bulk or event-driven also depends on our proposed solution. It can be a nightly job or just a continous stream of single updated records/objects. – Jakob Gade Aug 25 '10 at 06:42

3 Answers3

2

If you'd like a complete copy of their database, then log shipping is a good option, otherwise check out replication which gives you finer-grained control over what's copied.

You can copy across the data as-is and deal with the transformation logic on your side of the wire.

Will A
  • 24,780
  • 5
  • 50
  • 61
  • I think a more service-oriented architecture is what we're looking for: If we're going to maintain a complete copy of their database, it also means we'd have to make changes when they make changes to their model. There has to be some kind of middle-contract, that defines the data and its structure. – Jakob Gade Aug 25 '10 at 06:46
  • Both log shipping and replication will apply schema changes - so that might not be a breaking issue. – Will A Aug 25 '10 at 06:55
  • Yeah, but then we'd have a database that "magically" changes overnight, possibly breaking our import/transformation code. :) – Jakob Gade Aug 25 '10 at 06:57
  • True - in which case you'll definitely need a contract between yourselves and the remote client. In this case, SSIS (as suggested by Vidar) is probably going to be a great choice for you in consuming the data, transforming it, and populating your local table(s). – Will A Aug 25 '10 at 07:01
2

In addition to Will's suggestions, you can take a look at SSIS (SQL Server Integration Services). With this you can export to an FTP site and then import it on the other side.

Vidar Nordnes
  • 1,334
  • 10
  • 20
0

Probably, here is how I'd do it:

  1. Maintain a local copy of your partner's database. Let us call this staging because you'll transform the data here. Use Data Compare (VS 2010 feature which you can ofcourse automate) or Replication to achieve this. I'm not sure if Replication triggers CDC, but Data Compare sure does.
  2. Enable Change Data Capture on the staging database. This gives you only the changed records, so you can selectively transform the data and update your database.
  3. Build and deploy an SSIS job that scans the CDC tables, transforms the data and loads into your database.

As far as I know, Data Compare works well when the schema is changed (it ignores the schema changes). CDC also works well, all you need to do is create a second capture instance if there is a schema change and remove the first one.

Josh
  • 1
  • 1