Questions tagged [data-warehouse]

A data warehouse (DW) is a database specifically structured/designed to aid in querying, analyzing and reporting (generating reports out of) of current and historical data. DWs are central repositories of integrated data from one or more disparate sources. Basic difference between a data warehouse and a set of DB tables is how the data is organized/structured.

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data used for querying, analyzing and reporting for decision-support purposes.

Data Mart is the access layer of the data warehouse. It serves a particular department like Marketing, HR, etc. Dedicated to business function and unit specifications, data marts make the information more focused and faster to find.

Some differences between Data mart and Data warehouse:

  • Data Warehouses have multiple subject areas with more detailed information. They are integrating all sources of data. Dimensional modelling is not necessary, but it feeds dimensional models
  • Data Marts usually hold one subject area with not that detailed information - often summarized. Concentrate on integrating information from one subject area or source system. Built on dimensional models like star-schema.

There are many products readily available that provide data warehousing capability e.g. MSAccess, Essbase (Hyperion, now Oracle), Cognos, Business Objects, MicroStrategy, ...

Basics of Data Warehousing:

  • Dimensional Modelling - Consist of the identify the measurements, or facts, that are given the context by their related dimensions. The grain of the fact table describes the level of detail at which the facts are recorded.

Main steps of relational modelling:

  • Choose the business process
  • Declare the grain
  • Identify the dimensions
  • Identify the fact

Online Analytical Processing (OLAP) and it's types (ROLAP, MOLAP, HOLAP, ...): Describes basics of the DB designs and pros/cons of each way. - A variety of different design patterns are used in a data warehouse environment. Some common approaches include: Normalized (5NF); DataVault; Anchor Modelling; Dimensional (5,6); other temporal (e.g. 6NF). - SQL: Describes how a Data Warehouse can be queried. Following is a list of basic keywords that every data warehouse developer must know: - JOIN - GROUPBY

At a high level the Data Warehousing can be divided into:

  • Tools (IBM Cognos, Microsoft Business Intelligence, Oracle Business Intelligence Enterprise, dition(OBIEE), Business Objects Enterprise XI, Jaspersoft, Talend Open studio, Pentaho, Qlikview etc) readily available and how to use them. Used for small to medium sized data sets. This usually requires [at least] knowledge of tool's:
    • data model and
    • user interface
  • Building your own data warehouse for specific usecases. Used when dealing with really huge data sets (e.g. the data collected by Google, Yahoo, Facebook or a couters/performance-management-data from a large telecommunication network. This usually requires [at least] knowledge of:
    • scalability, high availability and clustering concepts.
    • data warehouse (schema, queries, data model, ...) design.
    • available databases (Oracle, Clustra, Greenplum, MySQL, DB2, ...)
    • problem domain (implicit).
    • relevant GUI/UI (SWING, JSP, ...) and business logic (J2EE, C++, ...) technologies
2778 questions
27
votes
12 answers

What is a data warehouse?

I was asked by a customer what the term "data warehouse" really means. I thought about ETL, details of the data model, differences to NoSQL, Clouds, 'normal' DBMS, MDM (Master Data Management) etc. but wasn't able to describe the term in a few…
FloE
  • 1,166
  • 1
  • 10
  • 19
27
votes
13 answers

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several…
littlegreen
  • 7,290
  • 9
  • 45
  • 51
26
votes
1 answer

What's the difference between additive, semi-additive, and non-additive measures

I have searched in the net what's the difference between additive, semi-additive, and non-additive measures in a data warehouse. I have found some results but I have difficulty understanding the differences because they aren't an example. Could you…
user2460074
  • 1,252
  • 3
  • 11
  • 28
24
votes
7 answers

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000…
knorv
  • 49,059
  • 74
  • 210
  • 294
23
votes
2 answers

Apache Spark + Delta Lake concepts

I have many doubts related to Spark + Delta. 1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold…
23
votes
3 answers

Design of a data warehouse with more than one fact tables

I'm new to data warehousing. First, I want to precise than my copy of The Data Warehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net. What I don't find on the net,…
user327961
  • 2,440
  • 3
  • 22
  • 20
23
votes
4 answers

Benefits of using Staging Database while designing Data Warehouse

I am in process of designing a Data Warehouse Architecture. While exploring various options to Extract data from Production and putting into Data Warehouse, I came across many articles which mainly suggested following two approaches - Production…
Prateek Singh
  • 863
  • 1
  • 8
  • 28
22
votes
4 answers

What does "bulk load" mean?

Jumping from article to article, I can see everywhere the expression "bulk loading". What does it really (technically) mean? What does it imply? Explanation based on use-cases is welcome.
Spredzy
  • 4,982
  • 13
  • 53
  • 69
22
votes
2 answers

What is Multi Dimension OLAP CUBE and give example cube with more than 3 dimensions

As I am new to SSAS, have been reading an article on Multi-Dimension OLAP Cube and struggling to understand Cube concepts, It has been said that Although the term "cube" suggests three dimensions, a cube can have up to 64 dimensions. Could you…
rmdussa
  • 1,549
  • 10
  • 27
  • 50
21
votes
1 answer

MERGE - conditional "WHEN MATCHED THEN UPDATE"

The highlights in the image below shows the logic I want to implement. I realize the syntax is incorrect. Is there a way to conditionally update a record in a MERGE statement only if it the value of one of its columns in the target table is NULL,…
sion_corn
  • 3,043
  • 8
  • 39
  • 65
19
votes
6 answers

How to extract data from Google Analytics and build a data warehouse (webhouse) from it?

I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required…
nkaur301
  • 191
  • 1
  • 1
  • 5
19
votes
4 answers

Time and date dimension in data warehouse

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables. (source: etl-tools.info) But I think,…
Piotr Gwiazda
  • 12,080
  • 13
  • 60
  • 91
18
votes
8 answers

Datamart vs. reporting Cube, what are the differences?

The terms are used all over the place, and I don't know of crisp definitions. I'm pretty sure I know what a data mart is. And I've created reporting cubes with tools like Business Objects and Cognos. I've also had folks tell me that a datamart is…
S.Lott
  • 384,516
  • 81
  • 508
  • 779
18
votes
5 answers

open source business intelligence solutions

which open source business intelligence solution would you recommend? All I need is to build some cubes and let the end user play with dimensions, filter data, sort, etc, and once it's done being able to export it to excel... I'd like the solution…
opensas
  • 60,462
  • 79
  • 252
  • 386
17
votes
7 answers

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually…
Kishore
  • 5,761
  • 5
  • 28
  • 53