0

I have several hadoop jobs which I run on EMR. A few of those jobs need to process the log files. The log files are huge ~3GB each in .gz format. The logs are stored on S3.

Presently, I use m1.xlarge for processing, it takes around 3hours just to copy the log file from S3 to HDFS. Here, the bottleneck is reading from S3 or writing to HDFS?

What I was planning is to use the new SSD based hi1.4xlarge as it has fast I/O, rather than m1.xlarge. But will it help in reducing the cost?

But the cost of hi1.4xlarge much more than m1.xlarge.

m1.xlarge - 8 EC2 compute units @ 0.614$ each = 4.912 $ /hour h1.4xlarge - 35 EC2 compute units @ 3.1$ each = 108.5 $ /hour

The price increase is around 23X. Will I get that much performance improvement? Consider my hadoop job to be high I/O bound.

I cannot test it myself by launching a hi1.4xlarge instance so asking it here on StackOverflow. Does anyone have any benchmarks comparing both the instance types? Google didn't help.

Regards.

Kartikeya Sinha
  • 508
  • 1
  • 5
  • 20

1 Answers1

1
  1. I do not think that SSD Instances are good choice since their value is in high random IO, while in Hadoop we need sequential IO.
  2. During copy from s3 to HDFS s3 is a bottleneck almost for sure.
  3. In order to save money I would suggest to try smaller instances to balance IO and CPU
  4. Are you using DISTCP to copy data from s3 to HDFS (just to check...)
  5. If you process logs once per cluster lifetime - you can process right from s3 and avoid copy to HDFS.
David Gruzman
  • 7,900
  • 1
  • 28
  • 30