1

The problem I'm encountering is this: Having already put my input.txt (50MBytes) file into HDFS, I'm running

python ./test.py hdfs:///user/myself/input.txt -r hadoop --hadoop-bin /usr/bin/hadoop 

It seems that MrJob spends a lot of time copying files to hdfs (again?)

Copying local files into hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/

Is this logical? Shouldn't it use input.txt directly from HDFS?

(Using Hadoop version 2.6.0)

Nikos
  • 95
  • 1
  • 9

1 Answers1

1

Look at the contents of hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/ and you will see that input.txt isn't the file that's being copied into HDFS.

What's being copied is mrjob's entire python directory, so that it can be unpacked on each of your nodes. (mrjob assumes that mrjob is not installed on each of the nodes in your cluster.)

vy32
  • 28,461
  • 37
  • 122
  • 246