Questions tagged [data-warehouse]

A data warehouse (DW) is a database specifically structured/designed to aid in querying, analyzing and reporting (generating reports out of) of current and historical data. DWs are central repositories of integrated data from one or more disparate sources. Basic difference between a data warehouse and a set of DB tables is how the data is organized/structured.

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data used for querying, analyzing and reporting for decision-support purposes.

Data Mart is the access layer of the data warehouse. It serves a particular department like Marketing, HR, etc. Dedicated to business function and unit specifications, data marts make the information more focused and faster to find.

Some differences between Data mart and Data warehouse:

  • Data Warehouses have multiple subject areas with more detailed information. They are integrating all sources of data. Dimensional modelling is not necessary, but it feeds dimensional models
  • Data Marts usually hold one subject area with not that detailed information - often summarized. Concentrate on integrating information from one subject area or source system. Built on dimensional models like star-schema.

There are many products readily available that provide data warehousing capability e.g. MSAccess, Essbase (Hyperion, now Oracle), Cognos, Business Objects, MicroStrategy, ...

Basics of Data Warehousing:

  • Dimensional Modelling - Consist of the identify the measurements, or facts, that are given the context by their related dimensions. The grain of the fact table describes the level of detail at which the facts are recorded.

Main steps of relational modelling:

  • Choose the business process
  • Declare the grain
  • Identify the dimensions
  • Identify the fact

Online Analytical Processing (OLAP) and it's types (ROLAP, MOLAP, HOLAP, ...): Describes basics of the DB designs and pros/cons of each way. - A variety of different design patterns are used in a data warehouse environment. Some common approaches include: Normalized (5NF); DataVault; Anchor Modelling; Dimensional (5,6); other temporal (e.g. 6NF). - SQL: Describes how a Data Warehouse can be queried. Following is a list of basic keywords that every data warehouse developer must know: - JOIN - GROUPBY

At a high level the Data Warehousing can be divided into:

  • Tools (IBM Cognos, Microsoft Business Intelligence, Oracle Business Intelligence Enterprise, dition(OBIEE), Business Objects Enterprise XI, Jaspersoft, Talend Open studio, Pentaho, Qlikview etc) readily available and how to use them. Used for small to medium sized data sets. This usually requires [at least] knowledge of tool's:
    • data model and
    • user interface
  • Building your own data warehouse for specific usecases. Used when dealing with really huge data sets (e.g. the data collected by Google, Yahoo, Facebook or a couters/performance-management-data from a large telecommunication network. This usually requires [at least] knowledge of:
    • scalability, high availability and clustering concepts.
    • data warehouse (schema, queries, data model, ...) design.
    • available databases (Oracle, Clustra, Greenplum, MySQL, DB2, ...)
    • problem domain (implicit).
    • relevant GUI/UI (SWING, JSP, ...) and business logic (J2EE, C++, ...) technologies
2778 questions
10
votes
2 answers

How to do Real-time loading into Amazon Redshift?

We are evaluating Amazon Redshift for real time data warehousing. Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per…
10
votes
2 answers

Can Spark Replace ETL Tool

Existing process - raw structure data are copied into a staging layer of Redshift. Then use ETL tools such as Informatica, Telend to do incremental loading into Fact and Dimension table of Datamart/datawarehouse. All joins happen within database…
10
votes
3 answers

Python: interact with complex data warehouse

We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation. Is there any best/standard practice for…
bukzor
  • 37,539
  • 11
  • 77
  • 111
10
votes
5 answers

Where I can download sample database which can be used as data warehouse?

Where I can download sample database which can be used for data warehouse creation? It should't be sample from Microsoft (Northwind etc.). EDIT: Sorry for not clarifying my question. At my university we have class where we must create some data…
GrZeCh
  • 2,332
  • 8
  • 29
  • 38
10
votes
2 answers

Is a fact table in normalized or de-normalized form?

I did a bit R&D on the fact tables, whether they are normalized or de-normalized. I came across some findings which make me confused. According to Kimball: Dimensional models combine normalized and denormalized table structures. The dimension…
10
votes
2 answers

joining across multiple fact tables with a dimension in between

What's a good approach to data warehouse design if requested reports require summarized information about the same dimensions (and at the same granularity) but the underlying data is stored in separate fact tables? For example, a report showing…
10
votes
6 answers

Data in different resolutions

I have two tables, records are being continuously inserted to these tables from outside source. Lets say these tables are keeping statistics of user interactions. When a user is clicking a button the details of that click (the user, time of click…
nimcap
  • 10,062
  • 15
  • 61
  • 69
9
votes
3 answers

How to classify/categorize strings according to regular expression rules in Python

I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database. The data looks like this…
ervingsb
  • 653
  • 8
  • 9
9
votes
1 answer

Difference between Transactional and Reporting Database

A friend commented like this: I'd recommend separating out your transactional data and your reporting data into a separate database And another one told me: I would implement an E-R schema in the transactional data and a star schema for the…
kuzey beytar
  • 3,076
  • 6
  • 37
  • 46
9
votes
5 answers

What difference of RDBMS and Hive?

In RDMS like MySQL thereis database, are there database also on the Hive ?as i read on the manual, hive only have table, i bit confuse about it.. and what is different concept of RDBMS and Hive ? Tks before
troya_adromeda
  • 647
  • 4
  • 15
  • 33
9
votes
3 answers

Group by vs Partition by in Oracle

I am writing a query to fetch records from a Oracle warehouse. Its a simple Select Query with joins on few tables and i have few columns to be aggregated. Hence i end up using Groupby on rest of the columns. Say I am picking some 10 columns and out…
user747858
9
votes
7 answers

alternatives to jpivot?

JPivot, even with all it's functionality, seems a bit outdated... last version at http://sourceforge.net/projects/jpivot/ is from 2008-03-17... is there any other lightweight client olap cube browser, that could take advantage of modern RIA kits,…
opensas
  • 60,462
  • 79
  • 252
  • 386
9
votes
3 answers

Subtract two dates in Microsoft SQL Server

I want to subtract 2 dates in MS SQL Server. Example: Current date Last used date '2016-03-30' '2015-02-03' Current date refers to today's date, "Last used date" is a measure. How to write a query in SQL Server? I have this but doesn't…
Shivang
  • 231
  • 1
  • 5
  • 17
9
votes
1 answer

How to best handle historical data changes in a Slowly Changing Dimension (SCD2)

The Tables: The company I work for has a Slowly Changing Dimension (employee data) which has been warehoused using the Kimball Method. The dimension table containing this data has a Primary Key (int identity employee_key, used as a surrogate in…
Matt G
  • 336
  • 3
  • 11
9
votes
4 answers

Can you recommend a good source for Teradata Best Practices?

Looks like my data warehouse project is moving to Teradata next year (from SQL Server 2005). I'm looking for resources about best practices on Teradata - from limitations of its SQL dialect to idioms and conventions for getting queries to perform…
Cade Roux
  • 88,164
  • 40
  • 182
  • 265