2

I am trying to build a "Data Lake" from scratch. I understand how a data lake works and the purpose of it; it's all over the internet. But when the question arises how to build one from scratch there is no source. I want to understand if:

Data warehouse + Hadoop = Data Lake

I know how to run Hadoop and bring in data into Hadoop. I want to build a sample on premise data lake to demo my manager. Any help is appreciated.

tk421
  • 5,775
  • 6
  • 23
  • 34
  • did you succeed building it ? ,I'm trying to build one but I don't know where to start, I installed Hadoop and don't know how to implement the data lake – Fatiha IMOUSSAINE Jul 01 '21 at 11:50

2 Answers2

0

You'd have to have structured and unstructured data to make a Hadoop cluster into a data lake.

So, you'd have to have some ETL pipeline taking the unstructured data and converting it to structured data. Product reviews or something similar would provide your unstructured data. Converting this to something usable by Hive (as an example) would give you your structured data.

I would look at https://opendata.stackexchange.com/ for getting your data and google Hadoop ETL for ideas on how to cleanse the data. It's up to you how you want to write your pipeline (Spark or MapReduce).

tk421
  • 5,775
  • 6
  • 23
  • 34
  • Is a Data lake just a dataware house constructed in Hadoop fashion ? – Abhinavneni Mar 04 '19 at 19:27
  • I have structured data, i have unstructured data. I have hadoop and hive installed. I can put data into hadoop. I will have spark to query my data and some other tools to analyze. Is that it? is this my data lake? – Abhinavneni Mar 04 '19 at 19:29
  • Pretty much. It might not be a useful data lake (as in your queries might not have any business value) but that's it. – tk421 Mar 04 '19 at 19:43
0

You can build datalake using AWS services. A simple way to do so is to use an AWS CloudFormation template to configure the solution, including AWS services such as Amazon S3 for unlimited data storage, Amazon Cognito for authentication, Amazon Elasticsearch for strong searching capabilities, AWS Lambda for microservices, AWS Glue for data transmission, and Amazon Athena for data analytics. The following figure represents the complete architecture of building a data lake on AWS using AWS services.

Refer this article for reference: https://medium.com/@pmahmoudzadeh/building-a-data-lake-on-aws-3f02f66a079e

meeza
  • 664
  • 1
  • 9
  • 20