0

A friend's company is working on a Data architecture which to us, it seems to be rather convoluted and having several scalability and cost problems.

If possible, I would like to have your opinion on the old and proposed architectures (or alternatives), to discuss their pros and cons and potentially finding unforeseen problems/limitations.

Current Architecture - Azure Stack

Ingestion Layer

  • Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks

Processing Layer

  • Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean

Loading Layer

  • Azure Databricks is used to load the data into a Azure SQL Server instance
  • Azure Synapse is used as the layer between Azure SQL Server and Azure Analysis Services

Presentation Layer

  • Data Models created in Azure Analysis Services and served into Power BI or Excel

Pros of this Approach

  • Models and optimisation (OLAP) generated via Azure Analysis Services
  • Fully integrated within the Azure ecosystem

Cons of this Approach

  • Vertically scalable: as the data grows, so will the Azure SQL Server used to store the Data Warehouse grow vertically, as will the Azure Analysis Services required to process said data
  • High Costs
  • Convoluted: a lot of services serving as glue, harder to manage/maintain
  • Always-on approach: both Azure SQL Server and Azure Analysis Services are required to be always on, representing unneeded expenses

Proposed Architecture - Azure w/ Delta Lake

The alternative architecture relies on the fact that Azure Databricks is already used within the ETL process and attempts to maximize its usage to provide horizontal scalability and Serverless resources.

Ingestion Layer

  • (same) Multiple sources stored into Azure Data Lake Gen2 via Azure Databricks

Processing and Loading Layers

  • (same) Azure Databricks cleans the data and stores them back into Azure Data Lake Gen2 into different sections: raw, clean
  • Azure Databricks generates the Data Warehouse using Delta Lake storing it directly in Azure Data Lake Gen2 creating Silver and Gold (Aggregation/Cubes) quality

Presentation Layer

  • Using Azure Synapse Analytics (Serverless) to designate access and querying capabilities directly on Azure Data Lake Gen2 which is then exposed to Power BI and Excel for Governance purposes

Pros of this Approach

  • Simpler
  • Horizontally scalable: the pipeline relies on Azure Delta Lake Gen2 (Parquet), which is naturally horizontally scalable; and Azure Synapse Analytics (Serverless) which, due to its Serverless pool, can also be considered horizontally scalable-ish.
  • Delta Lake provides a natural Audit layer and is easily integrated with existing Data Catalog solutions

Cons of this Approach

  • No Data Models to facilitate usage via Power BI and Excel
  • Critical to partition the Parquet files according to Azure Synapse Analytics's queries else we may incur in high transfer costs
  • No true OLAP capabilities
HGSF
  • 57
  • 5

1 Answers1

1

I think almost you covered. Based on my experience giving few suggestion . You can consider this approach if your business model allows.

enter image description here

Ingestion Layer:

  • Each team(business unit) should have different container for storage data. Reason : we can maintain access level in team level.
  • Since any distribution environment, always ELT process is recommended than ETL , we can use Azure data factory as ingestion tool to build a data lake. Reason : Databricks is for computing purpose , no point in use for ingestion purpose.
  • Each container should have 3 different folder in same container.
  • In first layer , Stage - which hold incremental loads on daily basis from source .(Based on the frequency).
  • So every time stage data will be appended in raw layer data .Finally Raw layer which will contain exact snap shot of source .
  • we should maintain some curated folder , sometimes we may required to handle some secure data . that time we can isolate secure data from other data.

Structure Layer :

  • In this Layer , we need to maintain the proper structured of the data. Take an example , Some time , we may required to maintain or convert from one format to another format. consider , one column has string type in source , but business need to convert string to decimal . those kind process should be taken care in this layer .

  • we can handle this transformation through Azure Databricks.

Serve Layer :

  • This is layer we will do all transformation for reporting layer . Example Team-1 and team-2 should be joined in this layer .

  • we can handle this transformation through Azure Databricks

Presentation Layer :

  • Using Azure Synapse Analytics (Serverless) to designate access and querying capabilities directly on Azure Data Lake Gen2 which is then exposed to Power BI and Excel for Governance purposes

  • Or we can connect serve layer through Databricks cluster JDBC connection. So that, handling all access control very easy if connect reporting layer from databricks.