MapReduce job with AWS Elastic MapReduce EMR - why 648 MB input was split into 27 map tasks?

Asked May 08 '19 at 02:20

Active May 08 '19 at 13:53

Viewed 67 times

I used AWS EMR (Hadoop streaming) for processing 648 MB input data in 9 text files (approx. 72 MB each stored in s3). I thought it split the data into either 64MB or 128MB blocks, but the log says that it split into 27 map tasks (I think one map task uses one mapper, right?) Can someone explain what was going on. I also don't understand why the CPU time of the entire job differs every time.

Also, it seems to me that EMR is quite different from Hadoop, and how to calculate the number of instances should be used with EMR? If I use s3 for data storage, I don't think I need to worry about the replication factor, right?

edited May 08 '19 at 13:53

leftjoin

36,950
8
57
116

asked May 08 '19 at 02:20

shebang

MapReduce job with AWS Elastic MapReduce EMR - why 648 MB input was split into 27 map tasks?

0 Answers0