Real time Streaming data into Azure Datawarehouse from sql server

Question

I'm trying to build a real-time reporting service on top of Microsoft Azure Data Warehouse. Currently I have a SQL server with about 5 TB of data. I want to stream the data to the data warehouse and use Azure DW's computation power to generate real-time reporting based on data. Is there any ready to use/best practices to do that?

One approach I was considering is to load data into Kafka and then stream it into Azure DW via Spark streaming. However, this approach is more near-real-time than real-time. Is there any way to utilize SQL Server Change Data Capture to stream data into data warehouse?

How are you defining "real-time"? What's the actual latency requirement you're working to? — Robin Moffatt, Oct 24 '17 at 08:23
as soon as the entity added/modified to master db I want to have it in my reporting data storage — taffarel, Oct 24 '17 at 08:25

score 4 · Accepted Answer · answered Oct 23 '17 at 15:55

I don't personally see Azure SQL Data Warehouse in a real-time architecture. It's a batch MPP system optimised for shredding billions of rows over multiple nodes. Such a pattern is not synonymous with sub-second or real-time performance, in my humble opinion. Real-time architectures tend to look more like Event Hubs > Stream Analytics in Azure. The low concurrency available (ie currently a max of 32 concurrent users) is also not a good fit for reporting.

As an alternative you might consider Azure SQL Database in-memory tables for fast load and then hand off to the warehouse at a convenient point.

You could Azure SQL Data Warehouse in a so-called Lambda architecture with a batch and real-time element, where is supports the batch stream. See here for further reading:

https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx

The idea of using Data Warehouse with SQL Database in memory tables seems interesting. I'll go and implement lambda architecture with Datawarehouse and in memory tables. — taffarel, Oct 25 '17 at 06:39

score 1 · Answer 2 · answered Oct 24 '17 at 18:42

If you’re looking for a SQL-based SaaS solution to power realtime reporting applications we recently released a HTTP API product called Stride, which is based on the open-source streaming-SQL database we build, PipelineDB, that can handle this type of workload.

The Stride API enables developers to run continuous SQL queries on streaming data and store the results of continuous queries in tables that get incrementally updated as new data arrives. This might be a simpler approach to adding the type of realtime analytics layer you mentioned above.

Feel free to check out the Stride technical docs for more detail.

Real time Streaming data into Azure Datawarehouse from sql server

2 Answers2