Best architecture for small to media data Project in Azure cloud

Question

I am part of a small data project where the data ingestion pipelines (4) are running for over a 6-month period and managed to gather close to 90mbs of data. There is a scope for data volume to increase but it will not grow more than 1TB in any given time.

I was asked to design a reliable infrastructure to Ingest, Transform and push the data to PowerBI. I give them the following

Orchestration engine - ADF
Storage - DataLake gen 2
Transformation - DBX
Model/view - SQL
Front-end - PowerBI
Medallion architecture to store the data

I recognize the need to cut the cost wherever possible, So I recommended minimum requirements for each resource in the architecture. despite my best efforts client's tech lead is not on board with the proposal.

He thinks Databricks is overkill for the project's data volume and prefer to use store procedures.

He thinks he only needs the curated layer, not the landing and staging. I explained to him the medallion structure benefits but he still doesn't want them because the data is very simple and transformation is simple too that is the reasoning.

So my question is,

Given that I have low volume data on the Azure cloud, should I consider using Databricks?

The main constraint is the cost.

If Databricks is not recommended for low volume data on Azure, what are the alternative solutions?

I checked other Azure services like function app and ADF but they are the same as DBX.

If I am not going to reuse the data, why should I implement a medallion architecture? (I am against the idea of not using this in the data project but I am open to suggestions)

Note: I am not against the medallion model, I know we keep the landing data for future use cases, and staging will have a parquet file for easy processing of large volumes of data and curated for business logic. Please point out something else that I missed here.

Ziya Mert Karakas · Answer 1 · 2023-08-20T23:55:53.687

I assume you meant 90 GBs not MBs. I also assumed this from pipeline running for months.

When cost-reduction goal is in place, always focus in on compute but not storage costs. Lets start from your ADF or Databricks question. You are already paying extra for using ADF instead of writing your own code (Low-code tools are always more expensive, or at least in theory). Its a common fallacy that ADF is cheaper than Databricks. Normally you can make Databricks even cheaper than ADF, by writing your own code in it and using the smallest cluster available, and micro-managing that cluster. ADF pipelines will end up being expensive when the data volume increases. Its hard to find a good answer to your cheapest ETL tool in cloud question because all tools are rather expensive due to compute involved, So its not because databricks is more expensive because of the platform itself, but the cluster you use. If you use a very small cluster, costs should not be too high. For the same amount of compute and everything ADF will cost you more, but might be easier and faster to set up the transform logic. Not long ago, all clusters on ADF(Dataflows) used to be Databricks' own clusters.

If you already have SQL server license, you can use SSIS and execute the packages from your ADF, this would be cheapest but the least efficient solution.

Medallion is useful for example when you need to do data lineage, or when you need a separate backup of data that you can start from again when you make a mistake. Silver tables are useful because you dont need to create them from Bronze again, and some business processes directly feed from silver instead of gold. And please dont worry about storage costs, they are dirt cheap. So you can store these 3 layers without much accrued cost. Parquet in ADLS2 is great, and its the gold-standard.

Its not a good idea to assume you will never reuse the data from the tables, there are many occasions I succumbed to going back to raw and starting everything all over again, so assumptions do not always work out. Gold layer, or the aggregated data has also the benefit of providing low latency due to performance for your end-point which in this case is PowerBI.

Nevertheless, its always good to have a copy of data available, because in the data world, sometimes when we mess up, there is no way to reverse it. Hope I was helpful

Thank you, @Ziya Mert Karakas your answer gave me new places to reduce the cost. I am afraid their current pipeline managed to collect 90mb of data not in GB. Appreciate your view. — surya prakash, Aug 19 '23 at 05:30
@suryaprakash, In stackoverflow people usually upvote if they liked an answer or accept it as such. But maybe you are not aware of that — Ziya Mert Karakas, Aug 19 '23 at 11:24
No, I am aware I thought you had something to add for low data volume infrastructure. Nevertheless, it was helpful. I upvoted. — surya prakash, Aug 19 '23 at 18:38
For maximal data volume of just under 1 TB as you described; what I said is still valid, by micro-managing a databricks cluster you can achieve a good cost picture, but if you have SQL server license SSIS might be the cheapest option on that front. Cloud data engineering is normally centered around big data, not small data, because customers come to cloud to access high amounts of compute with low notice period to availability for their workloads. I would go with Databricks in your scenario, and Medallion; its not absolutely necessary but there are benefits as I said. — Ziya Mert Karakas, Aug 19 '23 at 19:39

Best architecture for small to media data Project in Azure cloud

1 Answers1