A friend's company is working on a Data architecture which to us, it seems to be rather convoluted and having several scalability and cost problems.
If possible, I would like to have your opinion on the old and proposed architectures (or alternatives), to discuss their pros and cons and potentially finding unforeseen problems/limitations.
Current Architecture - Azure Stack
Ingestion Layer
- Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks
Processing Layer
- Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean
Loading Layer
- Azure Databricks is used to load the data into a Azure SQL Server instance
- Azure Synapse is used as the layer between Azure SQL Server and Azure Analysis Services
Presentation Layer
- Data Models created in Azure Analysis Services and served into Power BI or Excel
Pros of this Approach
- Models and optimisation (OLAP) generated via Azure Analysis Services
- Fully integrated within the Azure ecosystem
Cons of this Approach
- Vertically scalable: as the data grows, so will the Azure SQL Server used to store the Data Warehouse grow vertically, as will the Azure Analysis Services required to process said data
- High Costs
- Convoluted: a lot of services serving as glue, harder to manage/maintain
- Always-on approach: both Azure SQL Server and Azure Analysis Services are required to be always on, representing unneeded expenses
Proposed Architecture - Azure w/ Delta Lake
The alternative architecture relies on the fact that Azure Databricks is already used within the ETL process and attempts to maximize its usage to provide horizontal scalability and Serverless resources.
Ingestion Layer
- (same) Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks
Processing and Loading Layers
- (same) Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean
- Azure Databricks generates the Data Warehouse using Delta Lake storing it directly in Azure Data Lake Gen2 creating Silver and Gold (Aggregation/Cubes) quality
Presentation Layer
- Using Azure Synapse Analytics (Serverless) to designate access and querying capabilities directly on Azure Data Lake Gen2 which is then exposed to Power BI and Excel for Governance purposes
Pros of this Approach
- Simpler
- Horizontally scalable: the pipeline relies on Azure Delta Lake Gen2 (Parquet), which is naturally horizontally scalable; and Azure Synapse Analytics (Serverless) which, due to its Serverless pool, can also be considered horizontally scalable-ish.
- Delta Lake provides a natural Audit layer and is easily integrated with existing Data Catalog solutions
Cons of this Approach
- No Data Models to facilitate usage via Power BI and Excel
- Critical to partition the Parquet files according to Azure Synapse Analytics's queries else we may incur in high transfer costs
- No true OLAP capabilities