1

I want to understand data warehouse and data lake more in detail.

It seems to me there is different information to the topic. Inmon defines a data warehouse as

a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process

Now I understand that this is just a form of architecture and does not imply any technology. Which means that the underlying data can be any structure that could also be an S3 object storage. Moreover Waas et al. in On-Demand ELT Architecture for Right-Time BI: Extending the Vision proposed a data warehouse with a ELT process of integrating data.

When it comes to data lakes I found the following definition

scalable storage repository that holds a vast amount of raw data in its native format ("as is") until it is needed plus processing systems (engine) that can ingest data without compromising the data structure

taken from Data lake governance.

Now can a data warehouse be a more strict data lake? There has been an argument that a data warehouse must use ETL but according to Inmon the definiten does not include any restriction on data transformation? If data integration can be ELT and the there the transformation is agile e.g. it can be easily extended. A data warehouse looks very much like a data lake.

are my assumption correct or am looking at this from a skewed angle.

jmng
  • 2,479
  • 1
  • 25
  • 38
A.Dumas
  • 2,619
  • 3
  • 28
  • 51
  • 2
    One of the key purposes of a data warehouse is to convert data into special structures (i.e, star schema) optimized for querying. Data lake can't support this objective. Without such optimization, business intelligence is highly impractical (not impossible, but very unproductive). My view is that a data lake is repository of raw data - a place to store inputs for the data warehouse. It compliments data warehouse but can not replace it. – RADO Dec 11 '18 at 18:18
  • 2
    I agree with @RADO. I will add that data lakes are usually consumed by data scientists who are doing heavy analytics and a lot of experimentation. Data warehouse users tend to be ad-hoc, self-serve, or business analysts. The user base is different between the two. At our location, we consider the data lake to be part of our logical data warehouse. – Sam M Dec 11 '18 at 18:55

2 Answers2

4

A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).

You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.

A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.

Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

jmng
  • 2,479
  • 1
  • 25
  • 38
0

Data Warehouse was created to solve the need to do Analytical Data Processing on Enterprise Level and Structured data, it means

  • data is from throughout the organization and is usually brought to warehouse using ETL processes from various sources
  • Data in warehouse is structured and managed in format optimized for intensive analytical transformations. Most Warehouses structure data as Columnar Store and provide a SQL type interface to work with data.

Data Lake on the other hand was created to be one stop zone for all your organizations data. Data is in raw, unprocessed format straight from applications. You can also process data in lake either by moving them to warehouse or directly use them in Distributed Big Data processing systems.

So from this we see data warehouse is not data lake

  • since it does not have unstructured data
  • can only be used for compute intensive OLAP applications
sputnik
  • 38
  • 1
  • 6