I am part of a small data project where the data ingestion pipelines (4) are running for over a 6-month period and managed to gather close to 90mbs of data. There is a scope for data volume to increase but it will not grow more than 1TB in any given time.
I was asked to design a reliable infrastructure to Ingest, Transform and push the data to PowerBI. I give them the following
Orchestration engine - ADF
Storage - DataLake gen 2
Transformation - DBX
Model/view - SQL
Front-end - PowerBI
Medallion architecture to store the data
I recognize the need to cut the cost wherever possible, So I recommended minimum requirements for each resource in the architecture. despite my best efforts client's tech lead is not on board with the proposal.
He thinks Databricks is overkill for the project's data volume and prefer to use store procedures.
He thinks he only needs the curated layer, not the landing and staging. I explained to him the medallion structure benefits but he still doesn't want them because the data is very simple and transformation is simple too that is the reasoning.
So my question is,
- Given that I have low volume data on the Azure cloud, should I consider using Databricks?
The main constraint is the cost.
- If Databricks is not recommended for low volume data on Azure, what are the alternative solutions?
I checked other Azure services like function app and ADF but they are the same as DBX.
- If I am not going to reuse the data, why should I implement a medallion architecture? (I am against the idea of not using this in the data project but I am open to suggestions)
Note: I am not against the medallion model, I know we keep the landing data for future use cases, and staging will have a parquet file for easy processing of large volumes of data and curated for business logic. Please point out something else that I missed here.