4

I'm building a dimensional data warehouse and learning how to model my various business processes from my source system in my warehouse.

I'm currently modelling a "Bid" (bid for work) from our source system in our data warehouse as a fact table which contains information such as:

  • Bid amount
  • Projected revenue
  • Sales employee
  • Bid status (active, pending, rejected, etc)
  • etc.

The problem is that the bid (or most any other process I'm trying to model) can go through various states and have its information updated at any given moment in the source system. According to Ralph Kimball, fact tables should only be updated if they are considered "accumulating snapshot" and I'm sure that not all of these processes would be considered an "accumulating snapshot" by the definition below.

How should these type of processes be modeled in the data warehouse according to the Kimball group's recommendations? Further more, what type of fact table would work for a bid (given the facts I've outlined above)?

Excert from http://www.kimballgroup.com/2008/11/fact-tables/

The transaction grain corresponds to a measurement taken at a single instant. The grocery store beep is a transaction grain. The measured facts are valid only for that instant and for that event. The next measurement event could happen one millisecond later or next month or never. Thus, transaction grain fact tables are unpredictably sparse or dense. We have no guarantee that all the possible foreign keys will be represented. Transaction grain fact tables can be enormous, with the largest containing many billions of records.

The periodic snapshot grain corresponds to a predefined span of time, often a financial reporting period. Figure 1 illustrates a monthly account periodic snapshot. The measured facts summarize activity during or at the end of the time span. The periodic snapshot grain carries a powerful guarantee that all of the reporting entities (such as the bank account in Figure 1) will appear in each snapshot, even if there is no activity. The periodic snapshot is predictably dense, and applications can rely on combinations of keys always being present. Periodic snapshot fact tables can also get large. A bank with 20 million accounts and a 10-year history would have 2.4 billion records in the monthly account periodic snapshot!

The accumulating snapshot fact table corresponds to a predictable process that has a well-defined beginning and end. Order processing, claims processing, service call resolution and college admissions are typical candidates. The grain of an accumulating snapshot for order processing, for example, is usually the line item on the order. Notice in Figure 1 that there are multiple dates representing the standard scenario that an order undergoes. Accumulating snapshot records are revisited and overwritten as the process progresses through its steps from beginning to end. Accumulating snapshot fact tables generally are much smaller than the other two types because of this overwriting strategy.

Ryan Griffith
  • 1,591
  • 17
  • 41
  • You had actually answered on your question. What's you currently made is a pure snapshot. If you want to monitor biding process, the grain shoud be every bid, not just the ending point in loading time. – fenix Oct 14 '14 at 02:17
  • @fenix, let me alter the question slightly then: if I were to be modeling a line item (say an order item), that would lend itself to be considered a "transaction" fact which does not advocate updating, however the line items for orders within my system can be updated. How would one deal with that? – Ryan Griffith Oct 14 '14 at 02:35
  • It's called Change Data Capture. There is few easy techniques to implement it. – fenix Oct 14 '14 at 02:37
  • I understand how to technically deal with the issue, however I do not understand how to appropriately deal with the change in the context of Ralph Kimball's dimensional approach. – Ryan Griffith Oct 14 '14 at 02:41
  • You understood something wrong. Even transaction fact table have some updates, but the number of rows is being updating is almost zero in comparation to total number of rows in fact table. If you want to track every transaction, for example, bad input (quantity changed from 3 to 5), you can insert a change (Fk,Fk,Fk,Fk,+2) but you will loose grain uniqeness. – fenix Oct 14 '14 at 13:48
  • You have 2 choices, depending on what you want to store in your data warehouse. 1) Insert only final accepted bids. 2) Insert row(s) every time the status of a bid changes. In other words, insert on every status update. This is a much finer granularity, but if it meets your needs, it meets your needs. – Gilbert Le Blanc Dec 28 '15 at 15:47
  • This question is a good example of how Kimball's star schemas fail when it comes to anything that is not transactional in nature. He has "accumulating snapshots" but these are huge pains to handle. Or you invent transactions in your warehouse (essentially creating transactions for changes). If your data is point-in time, never updating, stars are great. Otherwise, might I suggest a real 5NF enterprise relational design? – N West Feb 16 '16 at 21:16

1 Answers1

2

Like one of the comments mention, Change Data Capture is a fairly generic term for "how do I handle changes to data entities over time", and there are entire books on it (and a gazillion posts and articles).

Regardless of any statements that seem to suggest a clear black-and-white or always-do-it-like-this answer, the real answer, as usual, is "it depends" - in your case, on what grain you need for your particular fact table.

If your data changes in unpredictable ways or very often, it can become challenging to implement Kimball's version of an accumulated snapshot (picture how many "milestone" date columns, etc. you might end up needing).

So, if you prefer, you can decide to make your fact table be an transactional fact table rather than a snapshot, where the fact key would be (Bid Key, Timestamp), and then in your application layer (whether a view, mview, actual app, or whatever), you can ensure that a given query only gets the latest version of each Bid (note that this can be thought of as kind of a virtual accumulated snapshot). If you find that you don't need the previous versions (the history of each Bid), you can have a routine that prunes them (i.e. deletes or moves them somewhere else).

Alternatively, you can only allow the fact (Bid) to be added when it is in it's final state, but then you will likely have a significant lag where a new (updateable) Bid doesn't make it to the fact table for some time.

Either way, there are several solid and proven techniques for handling this - you just have to clearly identify the business requirements and design accordingly.

Good luck!

SlimsGhost
  • 2,849
  • 1
  • 10
  • 16
  • I've started using fact tables with row_effective_date and row_expiration_date just like a dimension allowing the fact to change over time. – Ryan Griffith Dec 28 '15 at 18:04
  • @RyanGriffith - yes, exactly. that is the design I've used as well - Fact tables (and the events/history they represent) often end up behaving a lot like Type II Dimensions, but it works! – SlimsGhost Dec 29 '15 at 16:07