16

We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.

Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.

Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:

  • Upload tables: raw source system data, unmodified
  • Staging tables: intermediate processing, typed and cleansed
  • Warehouse tables

You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys has worked on a warehouse where there is a StagingInput and a StagingOutput, similar story. The team as a whole has a lot of experience, both datawarehouse and otherwise.

However, despite all this, looking through Kimball and the web there seems to be absolutely nothing in writing about giving any kind of structure to the staging database. One would be forgiven for believing that Mr Kimball would have us all work with staging as this big deep dark unstructured pool of data.

Whilst of course it is pretty obvious how to go about it if we want to add some more structure to the staging area, it seems very odd that there seems to be nothing written about it.

So, what is everyone else out there doing? Is staging just this big unstructured mess or do folk have some interesting designs on it?

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
NeedHack
  • 2,943
  • 3
  • 30
  • 44

7 Answers7

4

Just a note, there is a book called "The Data Warehouse ETL Toolkit" by Raph Kimball and Joe Caserta, so Mr. Kimball did put some effort into this. :)

Damir Sudarevic
  • 21,891
  • 3
  • 47
  • 71
4

I have experienced the same problem. We have a large HR DataWarehouse and I am pulling data from systems all over the enterprise. I've got a nice collection of Fact and Dimension tables, but the staging area is a mess. I don't know of any standards for design of this. I would follow the same path you are on and come up with a standard set of names to keep things in order. Your suggestion is pretty good for the naming. I'd keep working with that.

Tom H
  • 46,766
  • 14
  • 87
  • 128
Christian Loris
  • 4,159
  • 1
  • 20
  • 29
  • Curious, an area that nobody seems to be interested in, yet one which affects every BI project of any scale. I guess that the Upload and Staging distinction will give us some structure at least. – NeedHack May 27 '09 at 10:06
3

We are working on a large Insurance DWH project at the moment, its slightly complicated, but each of the source system tables are put into a separate schema in a STAGING database, then we have ETL that moves/cleanses/conforms(MDM) the data from the staging database into a STAGINGCLEAN database, then further ETL that moves the data into a Kimball DWH.

The separation of the Staging and the StagingClean database we find very helpful in diagnosing issues particularly on data quality, as we have dirty staged data as well as the cleaned version before it is transformed into the DWH proper.

Marcus D
  • 1,074
  • 1
  • 12
  • 27
  • We do this with regular imports to the production database (not a datawarehouse) as well. Can't tell you how much easier it is to see the million records uncleaned when trying to show that the problem is their data not our process. – HLGEM May 21 '13 at 20:41
2

There can be sub areas in Staging. Called staging1, staging2, for example.

Staging1 can be a directly pull from data sources with no transformation. And Staging1 only keeps the latest data.

Staging2 keeps data transformed and ready to go to warehouse. Staging2 keeps all historical data.

Ken Yao
  • 1,506
  • 1
  • 13
  • 24
  • Thanks Ken, yes, this is similar to designs I have worked with in the past. What I find strange is that there is nothing published about it. – NeedHack Jul 29 '09 at 14:09
  • I personally woudn't recommend tacking a number on the end of a table name to denote difference in the database. If I inheirited that schema, my first thought would be something like, 'oh, these must be abandoned tables that the team never deleted'. – yurisich Jan 13 '12 at 12:08
0

What a great question.

In the past we have used _MIRR (for mirror) suffix for untransformed data landed into the database, ie. it mirrors the source. Then we use _STG for the transformed data from source, then _DW for the star schema.

The staging tables here would be in 3NF. I think this is the key point. Data is landed untransformed and kept separate from next step where we fully normalize the data, before then flattening it all out into our star schema for reporting.

Ivan Ferić
  • 4,725
  • 11
  • 37
  • 47
0

Have a look at this post here. It gives a good overview of the responsibilities of a staging area within a DW.

Patrick Peters
  • 9,456
  • 7
  • 57
  • 106
-2

Personally, I don't go looking for trouble, in Kimball, or elsewhere.

What kind of "structure" are you looking for? What kind of "structure" do you feel is needed? What problems are you seeing from the lack of "structure" you have today?

I may be leaving you with the impression that I don't think much of Kimball. Not so - I haven't read Kimball. I just don't think much of changing things for no reason beyond fitting some pattern. Change to solve some real-world problem would be fine. For instance, if you find you're backing up staging tables because a lack of structure caused the staging and warehouse tables to be treated the same, then this would be a reason to change the structure. But if that's the sort of thing you had in mind, then you should edit your question to indicate it.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • The driver for us looking at this right now is that we need to be able to separate the "upload" process from the "staging" process as feeds become available at different times. We have a requirement to upload feeds as they become available, and then run process the rest of the ETL. At the moment the whole staging process is all mixed up in one big series of tasks. Quite aside from that, have a requirement to write structured software to satisfy our audit requirements. – NeedHack May 14 '09 at 14:59
  • @Chris: You should clarify your question, then. I read it to be about the tables within the database, and not about structuring the process. That's a whole different question. – John Saunders May 14 '09 at 15:11
  • I don't think we can completely seperate the structure of the ETL from that of the tables. Yes, my question was primarily about the tables structure (it goes against the grain to have a large number of tables with no RI, constraints or anything), but the ETL structure follows on from how the tables are arranged. – NeedHack May 27 '09 at 10:10