7

I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.

What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.

Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.

So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.

I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.

I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.

If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!

user3776554
  • 119
  • 4
  • 1
    Not sure why this has been downvoted. Without a comment, I guess we'll never know. If you think the OP has made a mistake please tell him/her, so they can address it. – David Rushton Dec 13 '16 at 10:43
  • I would like to provide another approach which we use successfully: Using Data Vault 2.0 as the main DWH modelling technique. It is easy to understand, scalable, very flexible... especially with a very complex enterprise DWH I would avoid a Kimball-only approach at all costs (becomes very costly if complex system environment is there) and implement another layer, which in our case is Data Vault based. – tobi6 Dec 15 '16 at 13:13

3 Answers3

9

For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.

If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.

But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.

The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.

The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.

Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.

Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.

In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.

Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.

David Rushton
  • 4,915
  • 1
  • 17
  • 31
  • I was thinking about using Inmons approach to isolate the population of our DMs from the source systems. Instead of one more sophisticated ETL from source system to Kimballs approach, have two. One to Inmons 3NF and then one to from Inmon's to DMs. As new source systems come online there is only one new ETL to the 3NF, nothing changes from the 3NF to the DMs. However the more I think about this... it does seem like this could take a significant amount of more effort. – user3776554 Dec 13 '16 at 16:09
  • Thanks for your words @destination-data, this kind of practical experience is gold! – Nick.Mc Dec 13 '16 at 21:32
  • 1
    Thanks Nick. Like your answer too. You make some good points. User3776554 -> I should add, I can see some benefits to an intermediate layer, especially if it is only for internal use. If I did remove my 3NF, I would still need to maintain some tables (for whitelists, blacklists, acceptable value lookups, etc). But I could do all of that with far fewer tables than I have today. In this case I think less really is more! Good luck with your project. I'm sure you'll have a great reporting solution online in no time. – David Rushton Dec 15 '16 at 13:47
  • To add to my +1, this is an excellent answer that passes on practical experience – Rich Mar 10 '17 at 14:17
5

Dimensions and measures are a well proven method for presenting and simplifying data to end users.

If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one

I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.

You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't

A star schema takes longer to build than just an ODS but gives many benefits including

  • Slowly changing dimensions can track changes over time
  • Denormalisation simplifies joins and improves performance
  • Surrogate keys allow you to disconnect from source systems
  • Conformed dimensions let you report across business units (i.e. Profit per headcount)

If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.

It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers

To paraphrase @destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • I'm not necessarily thinking of exposing the 3NF to the users. What I was attempting to do was isolate the population of DMs from our source systems (especially if they seem to change a lot). My thinking was that, by having ETLs from Inmons 3NF would abstract that layer (those ETLs should never change once setup). New system comes online... just map it to the 3NF... however the more I think about this I'm seeing all the additional steps that do need to be maintained and the benefit for the amount of work involved... so... – user3776554 Dec 13 '16 at 16:05
2

With ETL and back-end data wrangling taking up about 70% of the project time for this kind of endeavour, an extra layer makes a big difference. Its an extra layer of transforming from source to target, to agree with the business and to test. It all adds up.

Whilst I'm not saying that dimensional models (the Kimball kind) are always easy to change, you've got a whole lot more inflexibility should you have to always change lots of layers when you want to change your BI.

In fact, where I've been consulting in places that have data warehouses that are considered to be inflexible and expensive to develop for, and not keeping pace with changes to the business, they have without exception included the 3NF layer prior to the DMs. As Nick mentioned, it is hard nowadays to sell the idea of a 'proper' data warehouse as opposed to a Data Discovery Bi tool- and the appeal of these is often driven by DWs being seen to be slow and expensive to develop.

Kimball isn't against having a 3NF layer prior to his DW if it makes sense for a situation, he just doesn't agree with Inmon that there's a point.

One common misunderstanding is that Kimball proposes distinct data marts, so that you'd have to change it each time there is a different reporting request. Instead, Kimball's DMs are based on real life business processes and modelled accordingly. Although its true you will then try and make them suitable for reporting, you try and make them so they can answer forseaable queries. You don't aggregate and store just the aggregates: you work with the transactional data in a Kimball dimensional model. So no need to be reluctant from that perspective.

If an ODS works for you, then go for it- but a Kimball DW will meet the majority of requirements.

Rich
  • 2,207
  • 1
  • 23
  • 27