I have several hadoop jobs which I run on EMR. A few of those jobs need to process the log files. The log files are huge ~3GB each in .gz format. The logs are stored on S3.
Presently, I use m1.xlarge for processing, it takes around 3hours just to copy the log file from S3 to HDFS. Here, the bottleneck is reading from S3 or writing to HDFS?
What I was planning is to use the new SSD based hi1.4xlarge as it has fast I/O, rather than m1.xlarge. But will it help in reducing the cost?
But the cost of hi1.4xlarge much more than m1.xlarge.
m1.xlarge - 8 EC2 compute units @ 0.614$ each = 4.912 $ /hour h1.4xlarge - 35 EC2 compute units @ 3.1$ each = 108.5 $ /hour
The price increase is around 23X. Will I get that much performance improvement? Consider my hadoop job to be high I/O bound.
I cannot test it myself by launching a hi1.4xlarge instance so asking it here on StackOverflow. Does anyone have any benchmarks comparing both the instance types? Google didn't help.
Regards.