What is the difference between a data lake with HDFS or S3 in AWS?

Question

I need to build a data lake on AWS, but I don't know how exactly S3 is different from HDFS. I found some answers in the Internet but I still don't understand the real difference.

I also need to know if someone has the data lake architecture of HDFS and S3 in AWS.

What is your particular use-case? Do you only use one Hadoop cluster? How does the data get in & out? Feel free to edit your question to add more details, for a more detailed answer. — John Rotenstein, Jul 12 '19 at 04:01
I need to implement Informatica BDM in AWS and build a data lake — Aziza Sbai El Idrissi, Jul 12 '19 at 04:26
You might want to watch [AWS re:Invent 2018: Effective Data Lakes: Challenges and Design Patterns (ANT316) - YouTube](https://www.youtube.com/watch?v=v5lkNHib7bw) and [AWS re:Invent 2018: Intro to AWS Lake Formation - Build a secure data lake (ANT396) - YouTube](https://www.youtube.com/watch?v=nsiLMqg654s). — John Rotenstein, Jul 12 '19 at 07:25

score 4 · Accepted Answer · answered Jul 12 '19 at 04:01

HDFS is only accessible to the Hadoop cluster in which it exists. If the cluster turns off or is terminated, the data in HDFS will be gone.

Data in Amazon S3:

Remains available at all times (it cannot be 'turned off')
Is accessible to multiple clusters
Is accessible to other AWS services, such as Amazon Athena (which is 'Presto as a service', so you might not even need a Hadoop cluster)
Has multiple storage classes, such as storing less-frequently accessed data at a lower cost
Does not have storage limits (while HDFS is limited to the storage available in the Hadoop cluster)

What is the difference between a data lake with HDFS or S3 in AWS?

1 Answers1