1

Objective

I little bit confused by terminology: I've built Data Lake (not DW) based on Kimball's data modeling approaches and now not sure if I can use Data Mart definition to name my MPP database layer.

I came from the assumption that you still need Dimensional Modeling and Star Schema for mid+ size organization reports, same reasoning as in this article.

Questions

  1. Is it right to call Synapse a Data Mart at the following architecture (see picture below)?
  2. Can I say that I don't have DW (even if I have Star Schema), but instead I have Data Lake + Data Mart(s)?
  3. Shall I split Synapse into multiple schemas based on business/reports sub-domains (multiple Data Marts)?

Architecture details

enter image description here

To be more specific, in my case:

2-3) ADLS + Databricks form Data Lake. All ETL and Star Schema build happens at Data Lake layer. All logic seats here. Still it has structured and unstructured data at raw layer, use cheap ADLS storage, lack Governance, has ML and will have streaming in the future. In other hand, we have schema-on-write in all DL zones except raw, we have tables modeled upfront (with a lot of requirements changes during the process). Am I right to call it Data Lake?

4.) Synapse serves as a tiny projection/model of ETL/Lake results in order to speed up reports response time. Almost zero logic here, few aggregations. Only final model is loaded to Synapse. Data are not splitted by business sub-domains, we just load everythin in a single DATAMART schema. Is that a good approach?

VB_
  • 45,112
  • 42
  • 145
  • 293

1 Answers1

1

Firstly, I wouldn't get too bogged down in definitions as there are loads of (slightly) different definitions of these terms. However, given that, I would give a high-level definition of these terms as follows:

  1. Data Lake: this is your source data loaded into data store where you can start to analyse it. It is normally structured in the same way as it is in the source systems (i.e. it is the "raw" data) plus, optionally, some auditing columns to show where the data came from, when it was loaded, etc. Some data lakes have multiple layers e.g. the raw data layer and then a governed data layer where the data has been cleansed, standardised, etc. - but is still in basically the same structure as in the raw data layer

  2. Data Warehouse: this is your Kimball model of all your fact and dimension tables (plus other tables such as bridges). It will be built from the data that exists in your data lake

  3. Data Mart: this is a subject area sourced from your data warehouse. This might be a logical definition (e.g. the Sales mart is the Sales fact table and related dimensions) or it might be physicalised e.g. a single wide table generated from a fact and its dimensions. How you define your datamarts is normally dependent on who/what is consuming them and what their requirements are. For example, you could have multiple Sales Data marts, all based off the same Sales Star, because you have multiple tools that prefer to consume data structured in particular ways

Hope this helps?

NickW
  • 8,430
  • 2
  • 6
  • 19
  • so due to your answer, Databricks = Data Lake + Data Warehouse, while Synapse = Data Mart(s), right? I think it's right, just little bit confused of having both at Databricks, Google usually proposes "DL vs DW" instead of "DL + DW" – VB_ Sep 15 '20 at 09:33
  • I'm confused because looks like many people consider Data Lake as the next step of Data Warehouse evolution. For example, https://databricks.com/discover/data-lakes/history. Does the article mean they treat Star Schema as Data Lake as well? – VB_ Sep 15 '20 at 10:48
  • okay, I think I've found a notion of merging DL and DW together: Databricks calls it Lakehouse https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html. It's more about DW-like features in DL, but I think it means that you acn go dimensional in DL and be DW. – VB_ Sep 15 '20 at 11:18
  • I'd focus on ensuring that your data storage solution meets your requirements rather than on how different people define different terms (bear in mind that they often have something to sell you!). If you have tools that work best with star schemas then you'll need to build star schemas; if you have tools that work best with raw data (e.g. AI/ML) then you'll need to make that available. As long as it all works you can call all these different areas whatever you want – NickW Sep 15 '20 at 12:06
  • sure Lakehouse or Data Hub are often used as rebrending or just mean that new set of features has been released. The reason why I bother about terminology - I need to understand Youtube high-level conferences and articles. – VB_ Sep 15 '20 at 12:20