5

I have a requirement to build a data warehouse using Azure SQL Data Warehouse, sourced with data from a single Azure SQL Database. Looking at this article, it seems I have two options:

  • SSIS
  • Azure Data Factory

My Azure SQL Database is normalized, but I'd like to build my Azure SQL Data Warehouse in a star or snowflake schema format, so it's not a straight copy of tables - there definitely needs to be some transformation happening.

Given SSIS isn't supported on Azure SQL Database and would require me running a VM with SQL Server on it to keep my processes entirely in Azure, is Azure Data Factory the recommended tool to ETL data between Azure SQL Database and Azure SQL Data Warehouse? Would one choice vs. the other be more fully featured in this situation?

Thank you!

tarun713
  • 2,177
  • 3
  • 17
  • 31

2 Answers2

4

If you are looking to keep your processes entirely in the cloud and not spin up a VM, then Azure Data Factory is currently your best option for data movement. However, at this point in time I might recommend an ELT approach over an ETL approach. By loading your data first and then completing your transformations within the SQL Data Warehouse your loads will be quicker and you will be able to take advantage of additional compute power when transforming your data. Additionally, it sounds like you are familiar with SQL, so it may be easier for you to write your transformations inside of SQL Data Warehouse as opposed to using Data Factory' JSON.

  • Thank you for the response! A minor follow up - what tool would be best to perform the transformations inside SQL Data Warehouse? We're looking to capture historical data via snapshot-style tables that we'd insert to but never update, (i.e. on this day you were connected to Y others) as well as dimension data that would be inserted and updated. Would this be some kind of process that can be triggered by the completion of an ADF activity and possibly executes some sprocs on the ADW to populate the star/snowflake schema tables? – tarun713 Oct 28 '15 at 13:57
  • 1
    Data Factory will allow you to schedule your transformations in pipelines as well with the [SQLServerStoredProcedure](https://azure.microsoft.com/en-us/documentation/articles/data-factory-stored-proc-activity/) activity. This should allow you to schedule your data movement and sprocs as one without the need for triggers. – Drew DiPalma - MSFT Nov 04 '15 at 17:27
  • @tarun713 Yes, precisely, ADF activities allow a "clean up" stored procedure to be run upon completion of an activity, this would be some master stored procedure which would do all the work to populate your star schema. – Kyle Hale Aug 04 '17 at 17:55
0

Look into seeing whether creating an external table (in the DW) to point at your Azure SQL Database will help you out. With that, you might get away with issuing a few Create Table As (CTAS) commands to change the structure to stars/snowflakes.

From https://msdn.microsoft.com/en-us/library/dn935022.aspx

--Create an external data source for a remote database on Azure SQL Database
CREATE EXTERNAL DATA SOURCE data_source_name
    WITH ( 
        TYPE = RDBMS,
        LOCATION = '<server_name>.database.windows.net',
        DATABASE_NAME = 'Remote_Database_Name',
        CREDENTIAL = SQL_Credential
    )
[;]
fchukoskie
  • 346
  • 2
  • 11
  • This article fleshes-out how that works. https://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-query-getting-started-vertical/ – fchukoskie Oct 28 '15 at 13:45
  • Thanks for the response! This is very interesting and I was not aware of it, however, one of our goals is to build out the data warehouse db so that any analytics we want to run avoid putting load on our relational database. While implementing this might save us having to copy over data in some dimension tables, the load would still remain on our relational database - so I don't think this is an option for us. – tarun713 Oct 28 '15 at 13:56
  • There's going to be load on the relational db whenever you move the data no matter how you move it. I think that this idea just is a difference of a data push (using SSIS or Data Factory) logic versus a data pull (query the external table while saving it with a CTAS command). Once that new table is loaded, your load on the relational DB ends, and you do your remaining work on the new table that you loaded. So, the external table is just an interim step used during the load...not the final table that you'd query. – fchukoskie Oct 28 '15 at 14:01
  • 1
    This might be the missing piece that helps clarify what I mean. https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-develop-ctas/ – fchukoskie Oct 28 '15 at 14:04
  • Admittedly, I don't yet have a good handle on how one would schedule & trigger loads using this approach. – fchukoskie Oct 28 '15 at 14:15
  • 1
    Got it - thank you for that link, I understand this much better now. I see what you are saying and yes, this might be helpful to us - but you already answered what would be my next question - how to schedule or trigger this. Seems like we just need some way to schedule a job within Azure SQL Data Warehouse to make this approach or Drew's suggestion to work end to end. – tarun713 Oct 28 '15 at 14:21
  • I think that the scheduling would use Drew's approach in that the Data Factory could kick-off stored procedures. I'm not that far yet in my own design to have more details for you though, I'm afraid. Good luck! https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-integrate-azure-data-factory/#stored-procedures – fchukoskie Oct 28 '15 at 14:33
  • @fchukoskie just double checking my understanding that this type of external data source inside Azure SQL DW does not currently work. Have you made it work? – GregGalloway Mar 12 '16 at 13:05