AWS EMR performance HDFS vs S3

Question

In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS.

Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed.

Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?

score 8 · Answer 1 · edited Dec 03 '20 at 06:14

That's problematic on a different level.

S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close() or flush()) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.

What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.

However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).

Added:

As on 8/Dec/2020. S3 added support for strong consistancy across all Regions by default. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

That's not quite true. S3 has eventual consistency *in some regions* (namely, US East). Read After Write consistency is used everywhere else. For more information: http://aws.amazon.com/s3/faqs/#What_data_consistency_model_does_Amazon_S3_employ — Mark Roberts, Dec 06 '13 at 09:36
I should point out that Mark's information is out of date. From his link: "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." — Mike Baranczak, Aug 23 '17 at 20:17

score 5 · Answer 2 · answered May 26 '15 at 10:17

In terms of performance HDFS is better than S3

HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms

S3 is better if your load is variable, requires high durability and persistence with less cost.

For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html

score 3 · Answer 3 · answered Dec 21 '17 at 08:34

3

You must use S3 if you want to terminate the EMR cluster, because once you terminate the cluster - HDFS data will be deleted.

answered Dec 21 '17 at 08:34

Anwar

91
1
4

AWS EMR performance HDFS vs S3

3 Answers3

Linked