1

Hi I am doing a project on datawarehouse and i am not sure whether i correctly modeling my datawarehouse. My datawarehouse is not on business process thus i find very little information about that.

Basically I have alot of library file and each library file contain many cell information, each cell contain many pin information, each pin contain the timing and power information. The different library file basically contain the same number of cell and also pin formation, just the timing/power information is differen

library -> cell -> pin -> timing/power

I am interested to know the cell property -timing/power so that i can make comparison later.

Should i model by warehouse in snowflake schema where my fact table only contain foreign key of library dimension and date dimension. The library dimension is then further divided to cell dimension, and cell dimension is further divided to pin dimension, and pin dimension is divided into timing and power dimension

or in star schema where my fact table contain the foreign key of library,cell,pin,timing,power and date dimension?

My concern is my data is very large as i have around 200 library file, each library file contain around 20k cell, each cell contain few pin and each pin contain a few timing and power information. Thus the total size is large that is 200 x 20,000 x 4 x 4

I will be constantly pump in this large set of data whenever there is a new version of library file released

Can give me suggestion which one better?dfdf

Edit:

Library A
   Cell A
      Pin A1
        Condition A11
           riseTimingTemplate
           fallTimingTemplate
           risePowerTemplate
           fallPowerTemplate

The hierarchy is show above. Different library will contain the same cell,pin and condition and only the timing and power template is different.

Let say my fact grain would be timing and power value of particular cell
so my dimension would have library, cell, pin, condition, risingTimingTemplate, fallTimingTemplate, risePOwerTemplate and fallPowerTemplate, all link to the fact table correct?

xvi_16
  • 115
  • 1
  • 1
  • 10

1 Answers1

0

This will depend heavily on the database technology you are using, your reporting requirements, and ETL performance objectives; here are some thoughts to consider based on your situation above. The most direct part of the answer to your question is in the olap section below.

Inman:
For space efficiency, etl speed, and design simplicity; consider using an Inman design which is entity based. For your example, this might be one table for each entity with the facts stored directly on the table, linked in the hierarchy with a simple key/fk link between the tables. For example, you might have 4 tables as you mention above. Link these tables with a natural key to avoid dependencies on sort order or other random chance based keys in case you need to compare previously loaded data with the current load. This can also be more space efficient for facts where time is not smooth. However, this is a tradeoff if a time series or other series must be smooth as reporting may be more difficult to work out any smoothing requirements.

Kimball:
A good example to use a Kimball model in this case is when there are many different facts but all at the same grain. Mixing facts at different grains makes the data model more complex to build and use. I define a grain in your example as facts at the cell level and then facts at the power level are on a different grain assuming there is a 1-n relationship between each table. Where you might simply store measurements on the entity in an Inman design, you would normally separate facts from dimensions in the Kimball design which creates additional tables to keep these measures.

OLAP:
If you are using OLAP technologies for your query engine this is a little more complex. Most will require the data model as a star schema. Most engines won't allow a snowflake definition as it risks a 'product join' which happens when you have n-1 or n-n relations between dependent dimension tables. The result of n-n dimension tables can be duplicate fact rows if not carefully handled and these are very hard to troubleshoot when queries result in values too high when these duplicates occur in the data simply by joining in those n-n dimensions. ** If you can guarantee the outer dimensions are always 1-1 or 1-n (meaning 1 library to x cells but never 2+ libraries to a cell) and your measures are recorded at the timing/power level then a snowflake is a great way to store your data (least space) but you may still need to build 'views' so your design appears to be a star to your OLAP toolset (a few newer olap engines will allow these designs). Building your data as a Star will take more space but may allow you to spot n-n or n-1 incidents more easily.

Keep in mind, you have to keep the lowest level key in the fact table if you ever plan to report on the smaller grain (in this case the key you represent in the star, your snowflake will require the same key structure), its just a matter of space efficiency by not needing to duplicating the library information at each cell as you would in a star.

[Edit from comments] Data Model Based on comments and your supplied model, a few thoughts:

  • Fact table fk's are also pk's for that table
  • Model says you're not collecting facts on pin's or templates making those just informational. Might consider not using them if you have many pins/templates to one cell as they will introduce a product join that can create duplicate facts when joining those tables. However, keep them if you might ask 'given some unusual measures, how are my cells related by common pins?' or similar types of analysis.
  • Design indicates libraries relate to cells only through the facts collected. Making that relationship (lib to cell) dynamic.

Lastly a tip for the date dimension: the pk for date can be a formatted # based on the detail level of the date/time collection (ie: 20150101 might be the key for Jan 1 2015, add a second table for time if you need time otherwise the date table will expand in storage very quickly. Makes the ETL build faster and you can even skip joins to the date table for basic date data)

Tim Cederquist
  • 191
  • 3
  • 6
  • I will create my own simple query engine. So whether i use snowflake or not i will have 8 dimension table as mention above(my edit question)? Since the 2+ library will match to the same cell, so i will have alot of redundancy, any way to avoid it? – xvi_16 Dec 26 '14 at 09:06
  • When you say 2+ libraries will match a cell, do you mean lib1 links to cell2 and cell3 or does cell2 have links to lib1 and lib2? I'll update above once confirmed. Also, does rising and falling match each other semantically? Those might just be one fact table if those measures relate to each other, even if not, they can be four sets of facts/measures columns. – Tim Cederquist Dec 26 '14 at 15:56
  • if lib1 contain cellA cellB cellC,then lib2 contain also cellA cellB and cellC, same goes to the pin, what is different is the timing/power information. The library is characterize by the operating condition, in different operating condition(library),we have the same cell and same pin but contain different riseTiming,fallTiminf, risePOwer, fallPower info. And also should i use a factless fact table? where my timing/power info is stored in dimension rather than in fact? – xvi_16 Dec 26 '14 at 17:49
  • Sorry for all the questions but starting to understand - Cell and Pin are entities/dimensions shared by libraries. A library brings these together and you record facts about their behavior in a library. If this is true I'd suggest cell and power are outer dimensions for a snowflake. Library is the inner dimension and facts are recorded in a fact table linked to a library and a perhaps a time dimension? (throwing in a new dimension time) You would be reporting on a libraries performance with the ability to get attributes about the library by drilling to cells/pins. Can you share expected keys? – Tim Cederquist Dec 26 '14 at 21:52
  • what is inner and outer dimension? My grain would be the cell performance(timing/power), i would like to measure the cell performance across different operating condition(library) and time(next release of library file) http://postimg.org/image/p1nl1lkr7/ This is my current thought...do you think is legit? – xvi_16 Dec 27 '14 at 06:36
  • Inner is just a term I use for dimensions directly attached to facts, outer are dimensions attached to other dimensions. Yes, the design seems legit and I have updated comments above based on your response. – Tim Cederquist Dec 27 '14 at 07:54