1

I used AWS EMR (Hadoop streaming) for processing 648 MB input data in 9 text files (approx. 72 MB each stored in s3). I thought it split the data into either 64MB or 128MB blocks, but the log says that it split into 27 map tasks (I think one map task uses one mapper, right?) Can someone explain what was going on. I also don't understand why the CPU time of the entire job differs every time.

Also, it seems to me that EMR is quite different from Hadoop, and how to calculate the number of instances should be used with EMR? If I use s3 for data storage, I don't think I need to worry about the replication factor, right?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
shebang
  • 13
  • 3

0 Answers0