Questions tagged [data-warehouse]

A data warehouse (DW) is a database specifically structured/designed to aid in querying, analyzing and reporting (generating reports out of) of current and historical data. DWs are central repositories of integrated data from one or more disparate sources. Basic difference between a data warehouse and a set of DB tables is how the data is organized/structured.

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data used for querying, analyzing and reporting for decision-support purposes.

Data Mart is the access layer of the data warehouse. It serves a particular department like Marketing, HR, etc. Dedicated to business function and unit specifications, data marts make the information more focused and faster to find.

Some differences between Data mart and Data warehouse:

  • Data Warehouses have multiple subject areas with more detailed information. They are integrating all sources of data. Dimensional modelling is not necessary, but it feeds dimensional models
  • Data Marts usually hold one subject area with not that detailed information - often summarized. Concentrate on integrating information from one subject area or source system. Built on dimensional models like star-schema.

There are many products readily available that provide data warehousing capability e.g. MSAccess, Essbase (Hyperion, now Oracle), Cognos, Business Objects, MicroStrategy, ...

Basics of Data Warehousing:

  • Dimensional Modelling - Consist of the identify the measurements, or facts, that are given the context by their related dimensions. The grain of the fact table describes the level of detail at which the facts are recorded.

Main steps of relational modelling:

  • Choose the business process
  • Declare the grain
  • Identify the dimensions
  • Identify the fact

Online Analytical Processing (OLAP) and it's types (ROLAP, MOLAP, HOLAP, ...): Describes basics of the DB designs and pros/cons of each way. - A variety of different design patterns are used in a data warehouse environment. Some common approaches include: Normalized (5NF); DataVault; Anchor Modelling; Dimensional (5,6); other temporal (e.g. 6NF). - SQL: Describes how a Data Warehouse can be queried. Following is a list of basic keywords that every data warehouse developer must know: - JOIN - GROUPBY

At a high level the Data Warehousing can be divided into:

  • Tools (IBM Cognos, Microsoft Business Intelligence, Oracle Business Intelligence Enterprise, dition(OBIEE), Business Objects Enterprise XI, Jaspersoft, Talend Open studio, Pentaho, Qlikview etc) readily available and how to use them. Used for small to medium sized data sets. This usually requires [at least] knowledge of tool's:
    • data model and
    • user interface
  • Building your own data warehouse for specific usecases. Used when dealing with really huge data sets (e.g. the data collected by Google, Yahoo, Facebook or a couters/performance-management-data from a large telecommunication network. This usually requires [at least] knowledge of:
    • scalability, high availability and clustering concepts.
    • data warehouse (schema, queries, data model, ...) design.
    • available databases (Oracle, Clustra, Greenplum, MySQL, DB2, ...)
    • problem domain (implicit).
    • relevant GUI/UI (SWING, JSP, ...) and business logic (J2EE, C++, ...) technologies
2778 questions
17
votes
8 answers

Is it good practice to have foreign keys in a datawarehouse (relationships)?

I think the question is clear enough. Some of the columns in my datawarehouse table could have a relationship to a primary key. But is it good practice? It is denormalized, so it should never be deleted again (data in datawarehouse). Hope question…
Lieven Cardoen
  • 25,140
  • 52
  • 153
  • 244
16
votes
7 answers

Structure within staging area of data warehouse

We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process. Kimball talks about using the staging area for import, cleaning,…
NeedHack
  • 2,943
  • 3
  • 30
  • 44
15
votes
2 answers

Using a DATE field as primary key of a date dimension with MySQL

I want to handle a date dimension in a MySQL datawarehouse. (I m a newbie in the DW world) I made some searches with google and saw a lot of table structures (most of) date dimension where the Primary Key is a simple UNSIGNED INTEGER. Why don't use…
nemenems
  • 1,064
  • 2
  • 9
  • 27
15
votes
3 answers

Calendar tables in PostgreSQL 9

I am building an analytics database (I have a firm understanding of the data and the business objectives and only basic-to-moderate database skills). I have come across some references to building similar warehouses which implement the concept of…
NJ.
  • 2,155
  • 6
  • 26
  • 35
15
votes
3 answers

ETL using Python

I am working on a data warehouse and looking for an ETL solution that uses Python. I have played with SnapLogic as an ETL, but I was wondering if there were any other solutions out there. This data warehouse is just getting started. Ihave not…
emilam
  • 195
  • 2
  • 8
15
votes
1 answer

Is there a concept of slowly changing FACT in data warehouse

In data warehousing, we have the concept of slowly changing dimensions. I am just wondering why there is no jargon for 'slowly/rapidly changing FACTs' because the same Type1, Type 2 measures can be used to track changes in the FACT table.
Victor
  • 16,609
  • 71
  • 229
  • 409
15
votes
4 answers

Why primary key is (not) required on fact table in dimensional modelling?

I have heard a few references that pk is not required on fact table. I believe every single table should have a pk. How could a person understand a row in a fact table if there is no pk and 10+ foreign keys.
BI Dude
  • 1,842
  • 5
  • 37
  • 67
15
votes
7 answers

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do…
reconbot
  • 5,138
  • 6
  • 45
  • 63
14
votes
2 answers

Database - fact table and dimension table

When reading a book for business objects, I came across the term- fact table and dimension table. Is this the standard thing for all the database that they all have fact table and dimension table or is it just for business object design? I am…
Jack_of_All_Trades
  • 10,942
  • 18
  • 58
  • 88
14
votes
6 answers

Advantages of databases like Greenplum or Vertica compared to MongoDB or Cassandra

I am currently working in a few projects with MongoDB and Apache Cassandra respectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplum and Vertica the first time in the last week and I…
disco crazy
  • 31,313
  • 12
  • 80
  • 83
14
votes
2 answers

What does "incremental load" mean?

I regularly see the expression 'incremental loading' when reading articles What does is really (technically) mean? What does it implies ? Explanations using use-cases are welcome.
Spredzy
  • 4,982
  • 13
  • 53
  • 69
14
votes
5 answers

How to test (unit test) on ETL process?

I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering. How do people usually do testing/unit test/functional test on ETL process?
Hello lad
  • 17,344
  • 46
  • 127
  • 200
14
votes
4 answers

Which is better, ETL or ELT?

Having spent some time working on data warehousing, I have created both ETL (extract transform load) and ELT (extract load transform) processes. It seems that ELT is a newer approach to populating data warehouses that can more easily take advantage…
Chris J
  • 9,164
  • 7
  • 40
  • 39
14
votes
6 answers

concurrent statistics gathering on Oracle 11g partiitioned table

I am developing a DWH on Oracle 11g. We have some big tables (250+ million rows), partitioned by value. Each partition is a assigned to a different feeding source, and every partition is independent from others, so they can be loaded and processed…
Francesco
  • 161
  • 1
  • 10
14
votes
4 answers

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but…