Questions tagged [data-warehouse]

A data warehouse (DW) is a database specifically structured/designed to aid in querying, analyzing and reporting (generating reports out of) of current and historical data. DWs are central repositories of integrated data from one or more disparate sources. Basic difference between a data warehouse and a set of DB tables is how the data is organized/structured.

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data used for querying, analyzing and reporting for decision-support purposes.

Data Mart is the access layer of the data warehouse. It serves a particular department like Marketing, HR, etc. Dedicated to business function and unit specifications, data marts make the information more focused and faster to find.

Some differences between Data mart and Data warehouse:

  • Data Warehouses have multiple subject areas with more detailed information. They are integrating all sources of data. Dimensional modelling is not necessary, but it feeds dimensional models
  • Data Marts usually hold one subject area with not that detailed information - often summarized. Concentrate on integrating information from one subject area or source system. Built on dimensional models like star-schema.

There are many products readily available that provide data warehousing capability e.g. MSAccess, Essbase (Hyperion, now Oracle), Cognos, Business Objects, MicroStrategy, ...

Basics of Data Warehousing:

  • Dimensional Modelling - Consist of the identify the measurements, or facts, that are given the context by their related dimensions. The grain of the fact table describes the level of detail at which the facts are recorded.

Main steps of relational modelling:

  • Choose the business process
  • Declare the grain
  • Identify the dimensions
  • Identify the fact

Online Analytical Processing (OLAP) and it's types (ROLAP, MOLAP, HOLAP, ...): Describes basics of the DB designs and pros/cons of each way. - A variety of different design patterns are used in a data warehouse environment. Some common approaches include: Normalized (5NF); DataVault; Anchor Modelling; Dimensional (5,6); other temporal (e.g. 6NF). - SQL: Describes how a Data Warehouse can be queried. Following is a list of basic keywords that every data warehouse developer must know: - JOIN - GROUPBY

At a high level the Data Warehousing can be divided into:

  • Tools (IBM Cognos, Microsoft Business Intelligence, Oracle Business Intelligence Enterprise, dition(OBIEE), Business Objects Enterprise XI, Jaspersoft, Talend Open studio, Pentaho, Qlikview etc) readily available and how to use them. Used for small to medium sized data sets. This usually requires [at least] knowledge of tool's:
    • data model and
    • user interface
  • Building your own data warehouse for specific usecases. Used when dealing with really huge data sets (e.g. the data collected by Google, Yahoo, Facebook or a couters/performance-management-data from a large telecommunication network. This usually requires [at least] knowledge of:
    • scalability, high availability and clustering concepts.
    • data warehouse (schema, queries, data model, ...) design.
    • available databases (Oracle, Clustra, Greenplum, MySQL, DB2, ...)
    • problem domain (implicit).
    • relevant GUI/UI (SWING, JSP, ...) and business logic (J2EE, C++, ...) technologies
2778 questions
13
votes
2 answers

Star schema, normalized dimensions, denormalized hierarchy level keys

Given the following star schema tables. fact, two dimensions, two measures. # geog_abb time_date amount value #1: AL 2013-03-26 55.57 9113.3898 #2: CO 2011-06-28 19.25 9846.6468 #3: MI 2012-05-15 94.87 4762.5398 #4: …
13
votes
4 answers

How to calculate Azure SQL Data Warehouse DWU?

I am analyzing Azure SQL DW and I came across the term DWU (Data warehouse units). The link on Azure site only mentions a crude definition of DWU. I want to understand how DWU is calculated and how should I scale my system accordingly. I have also…
ViSu
  • 462
  • 2
  • 4
  • 17
13
votes
2 answers

Is star schema still necessary for a big-data-warehouse?

I am designing a new hadoop-based data warehouse using hive and I was wondering whether the classic star/snowflake schemas were still a "standard" in this context. Big Data systems embrace redundancy so that fully normalized schemas have usually…
Nicola Ferraro
  • 4,051
  • 5
  • 28
  • 60
13
votes
2 answers

SSIS - OLE DB Destination - Table or Views load vs. Fast-load

From what I've read: Table or Views data access mode commits each row at a time as a transaction. Thus, making a package transferring 5 million rows take long (30+ minutes). Fast-load data access mode allows one to specify the batch rows and the…
Ramon Gonzalez
  • 241
  • 1
  • 4
  • 11
13
votes
4 answers

Calendar table for Data Warehouse

For my data warehouse, I am creating a calendar table as follows: SET NOCOUNT ON DROP Table dbo.Calendar GO Create Table dbo.Calendar ( CalendarId Integer NOT NULL, DateValue Date NOT NULL, DayNumberOfWeek …
Raj More
  • 47,048
  • 33
  • 131
  • 198
12
votes
1 answer

How to create history fact table?

I have some entities in my Data Warehouse: Person - with attributes personId, dateFrom, dateTo, and others those can be changed, e.g. last name, birth date and so on - slowly changing dimension Document - documentId, number, type Address -…
Argnist
  • 535
  • 1
  • 5
  • 18
12
votes
1 answer

Generating star schema in hive

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex: I am completely new to Hadoop and I came to know that I can build data…
Zerotoinfinity
  • 6,290
  • 32
  • 130
  • 206
12
votes
2 answers

What should I have in mind when building OLAP solution from scratch?

I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are…
12
votes
4 answers

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are: Website Visitor Each unique visitor will have have a single row in the database with information like landing page, time of day,…
Paras Chopra
  • 4,029
  • 4
  • 21
  • 19
11
votes
4 answers

Extrapolate daily historical values from a table that only records when a value changes (Postgresql 9.3)

I have a table that records a row for each time a score for a location has changed. score_history: id int PK (uuid auto incrementing int) happened_at timestamp (when the score changed) location_id int FK (the location that the value is for) score…
Nick
  • 8,483
  • 10
  • 46
  • 65
11
votes
3 answers

date/time dimension

I am designing a data warehouse and I have a sticky issue with time. The grain I need is hourly (to calculate aggregate counts of events per hour) and I also have to accommodate a shift pattern that does not conveniently fit inside a 24 hour period…
dfoster99
  • 131
  • 1
  • 5
11
votes
4 answers

How Do I aggregate Data By Day and Still Respect Timezone?

We are currently using a summary table that aggregates information for our users on an hourly basis in UTC time. The problem we are having is that this table is becoming too large and slowing our system down immensely. We have done all the tuning…
Russ Bradberry
  • 10,705
  • 17
  • 69
  • 85
11
votes
4 answers

Database warehouse design: fact tables and dimension tables

I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as: sex (true/false) demographic classification (A, B, C etc) place of birth date of birth weight (recorded daily): The fact that is…
morpheous
  • 16,270
  • 32
  • 89
  • 120
11
votes
4 answers

Is it possible to partially refresh a materialized view in Oracle?

I have a very complex Oracle view based on other materialized views, regular views as well as some tables (I can't "fast refresh" it). Most of the time, existing records in this view are based on a date and are "stable", with new record sets having…
Galghamon
  • 2,012
  • 18
  • 27
11
votes
2 answers

OLAP - Calculate run-off triangles, sample data and cube included (PostgreSQL/Mondrian)

Reality description: We do have a list of projects. In every project there is a lot of accounts. You can do a lot of actions on every account. I do have the following dimensions and fact table defined (simplified): Dimensions and attributes: …
Tomas Greif
  • 21,685
  • 23
  • 106
  • 155