2

I'm trying to build a real-time reporting service on top of Microsoft Azure Data Warehouse. Currently I have a SQL server with about 5 TB of data. I want to stream the data to the data warehouse and use Azure DW's computation power to generate real-time reporting based on data. Is there any ready to use/best practices to do that?

One approach I was considering is to load data into Kafka and then stream it into Azure DW via Spark streaming. However, this approach is more near-real-time than real-time. Is there any way to utilize SQL Server Change Data Capture to stream data into data warehouse?

Rob Watts
  • 6,866
  • 3
  • 39
  • 58
taffarel
  • 4,015
  • 4
  • 33
  • 61

2 Answers2

4

I don't personally see Azure SQL Data Warehouse in a real-time architecture. It's a batch MPP system optimised for shredding billions of rows over multiple nodes. Such a pattern is not synonymous with sub-second or real-time performance, in my humble opinion. Real-time architectures tend to look more like Event Hubs > Stream Analytics in Azure. The low concurrency available (ie currently a max of 32 concurrent users) is also not a good fit for reporting.

As an alternative you might consider Azure SQL Database in-memory tables for fast load and then hand off to the warehouse at a convenient point.

You could Azure SQL Data Warehouse in a so-called Lambda architecture with a batch and real-time element, where is supports the batch stream. See here for further reading:

https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx

wBob
  • 13,710
  • 3
  • 20
  • 37
  • The idea of using Data Warehouse with SQL Database in memory tables seems interesting. I'll go and implement lambda architecture with Datawarehouse and in memory tables. – taffarel Oct 25 '17 at 06:39
1

If you’re looking for a SQL-based SaaS solution to power realtime reporting applications we recently released a HTTP API product called Stride, which is based on the open-source streaming-SQL database we build, PipelineDB, that can handle this type of workload.

The Stride API enables developers to run continuous SQL queries on streaming data and store the results of continuous queries in tables that get incrementally updated as new data arrives. This might be a simpler approach to adding the type of realtime analytics layer you mentioned above.

Feel free to check out the Stride technical docs for more detail.