MrJob spends a lot of time Copying local files into hdfs

Question

The problem I'm encountering is this: Having already put my input.txt (50MBytes) file into HDFS, I'm running

python ./test.py hdfs:///user/myself/input.txt -r hadoop --hadoop-bin /usr/bin/hadoop

It seems that MrJob spends a lot of time copying files to hdfs (again?)

Copying local files into hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/

Is this logical? Shouldn't it use input.txt directly from HDFS?

(Using Hadoop version 2.6.0)

score 1 · Answer 1 · answered Feb 17 '16 at 11:29

Look at the contents of hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/ and you will see that input.txt isn't the file that's being copied into HDFS.

What's being copied is mrjob's entire python directory, so that it can be unpacked on each of your nodes. (mrjob assumes that mrjob is not installed on each of the nodes in your cluster.)

MrJob spends a lot of time Copying local files into hdfs

1 Answers1