1

I'm reading Kimball's data-warehouse book at the moment and at the same time, I'm looking at designing a data warehouse for the following example:

1: Client can buy products which would generate an invoice which then would generate a inventory row. Note: The inventory row is a accumulating table, i.e. if they order Product A with a quantity of 5, and then another 5, there would be 2 invoice lines, and one inventory row of 10 units.

So based on the scenario described, the following design would be appropriate: 1. Client dimension 2. Product dimension 3. Date dimension 3. Invoice fact 4. Inventory fact

Now, let's say I want to find out answers to questions like 1. What clients currently have zero inventory but have generated invoices in the last 3 months . There will be other similar questions like these as well. 2. What clients have product A but don't have product B

Based on Kimballs book, it seems like there would need be some sort of joining between the two fact tables, like a subset?

This would seem rather inefficient, especially the sheer size of records between the 2 fact tables. Is this the right approach?

user172839
  • 1,035
  • 1
  • 10
  • 19
  • 1
    The only relational data model that would be more 'efficient' is if you pre calculated the values into a special table. There are always performance issues to be overcome in a relational database. A star schema is often faster than OLTP because data is pre processed to suit reporting. – Nick.Mc Jan 06 '19 at 23:07
  • This model is suppose to be a star schema but the queries I'm looking to be done needs access to both fact tables. – user172839 Jan 06 '19 at 23:33
  • 1
    Yes that’s normal in a Star schema. You use conformed (common) dimensions to report across multiple facts. – Nick.Mc Jan 06 '19 at 23:53
  • The crutial design part is the *partitioning* of both fact dimensions, that could enable a) effective access over a time interval and b) efficient (partition-wise) joins between the fact tables. – Marmite Bomber Jan 07 '19 at 09:42

2 Answers2

1

What clients currently have zero inventory but have generated invoices in the last 3 months.

SELECT C.* 
FROM DimClient C
WHERE EXISTS (
  SELECT * FROM FactInvoice I 
  WHERE C.ClientKey = I.ClientKey
  AND InvoiceDate >= DATEADD(m,GetDate(),-3)
  )
AND NOT EXISTS (
    SELECT * FROM FactInventory V 
    WHERE InventoryLevel > 0 
    AND V.ClientKey = C.ClientKey
    AND (Maybe a predicate to limit to current inventory only)
    )

What clients have product A but don't have product B

SELECT C.* 
FROM DimClient C
WHERE EXISTS (
  SELECT * FROM FactInvoice I 
  WHERE C.ClientKey = I.ClientKey
  AND I.ProductID IN (1,2,3)
  )
AND NOT EXISTS (
  SELECT * FROM FactInvoice I 
  WHERE C.ClientKey = I.ClientKey
  AND I.ProductID NOT IN (4,5,6)
  )

Are these queries inefficient? No more inefficient than the same query run against the source system

In fact your source system is probably going to need more tables, more joins, and will probably have less indexes

One of the basis for star schemas is that they are optimised for analytics.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • Sorry I'm not qute understanding this. I'll be running these queries through the data warehousing system and not as sql queries. – user172839 Jan 07 '19 at 21:06
  • Are you saying you are using some kind of client tool? Which one? These types of second order questions can be tricky to answer in most adhoc browsing tools. – Nick.Mc Jan 07 '19 at 21:44
  • I'm looking to answer these queries using a data warehousing system, i.e. SSAS. – user172839 Jan 07 '19 at 21:53
  • In SSAS you use DAX or MDX to answer those questions instead of SQL – Nick.Mc Jan 07 '19 at 21:55
  • Thanks. With the current setup, it doesn't seem like using DAX/MDX would be any more efficient then performing a simple sql query? – user172839 Jan 07 '19 at 21:56
  • I don't know what you mean by "more efficient". This question is now officially going around in circles – Nick.Mc Jan 07 '19 at 23:30
  • Ok what I'm trying to say is these type of queries aren't that efficient in sql if dealing with large amounts of data. The example I gave is just a simple scenario, but you may need to do analyse more comlicated queries such as "what clients have zero inventory, have bought in the last 3 months and has bought product b" etc. To perform such a query in SQL would be quite slow. Using dax to query two fact tables would seem to be no better ? – user172839 Jan 08 '19 at 00:07
  • By efficiency do you mean performance? Add some RAM and indexes? Do you mean efficiency as in it takes a special analyst to come up with and write the query? That's required for any non-trivial question. I call these queries 'second order' because they aren't just simple summaries. With question like this you normally need to go back to the asker and deal with ambiguities in the original question. It's difficult to describe these in non technical language. – Nick.Mc Jan 08 '19 at 00:40
  • Maybe you're saying the data model doesn't seem to be "effiicient". For analysis, star schemas are an improvement over fully normalised schemas. If you have a complex question that is frequently asked you create a 'summary' table and pre-calc the data in it. Problem is you can't pre calc this kind of thing if for example the basket changes because the combinations explode. – Nick.Mc Jan 08 '19 at 00:42
  • 1
    In actual fact DAX is really good at this kind of thing because it can recognise the filters being used be the end user tool and calc off that. For example a user could filter on particular items in Power BI and you can write DAX that recognises that filter in real time and calc's further analysis off that. I'll try and find a reference. DAX is faster because it's compressed and in memory but in _theory_ you can get the same result in SQL with columnstores and lots of RAM. – Nick.Mc Jan 08 '19 at 00:44
  • Here's a sample of basket analysis where you pick two sets of products. NOTE: This is based off a..... star schema https://www.daxpatterns.com/basket-analysis/ – Nick.Mc Jan 08 '19 at 00:46
  • I notice you have a _lot_ of questions answered in great detail where you have not selected an answer. Please go back and consider ticking answers if they helped – Nick.Mc Feb 03 '19 at 12:17
0

According to Kimball that seems to be the right approach to building drilling-across. He mentions that for this to work correctly it should be set up to use conformed dimensions as defined here and explained here. I'm not able to speak to how efficient/inefficient that would be in this case though.

Marc0
  • 181
  • 7