I have a large amount of data stored in an HDFS system (or, alternatively, in Amazon S3).
I want to process it using mrjob.
Unfortunately, when run mrjob and give the HDFS file name or the containing directory name, I get an error.
For example, here I have the data stored in the directory hdfs://user/hadoop/in1/
. For testing my file is hdfs://user/hadoop/in1/BCES_FY2014_clean.csv
but in production I will want multiple files there.
The file is present:
$ hdfs dfs -ls /user/hadoop/in1/
Found 1 items
-rw-r--r-- 1 hadoop hadoop 1771685 2015-12-07 03:05 /user/hadoop/in1/BCES_FY2014_clean.csv
$
But when I try to run it with mrjob, I get this error:
$ python mrjob_salary_max.py -r hadoop hdfs://user/hadoop/in1/BCES_FY2014_clean.csv
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
STDERR: -ls: java.net.UnknownHostException: user
STDERR: Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path> ...]
Traceback (most recent call last):
File "mrjob_salary_max.py", line 26, in <module>
salarymax.run()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/job.py", line 461, in run
mr_job.execute()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/job.py", line 479, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/launch.py", line 153, in execute
self.run_job()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/launch.py", line 216, in run_job
runner.run()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/runner.py", line 470, in run
self._run()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/hadoop.py", line 233, in _run
self._check_input_exists()
File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/hadoop.py", line 249, in _check_input_exists
'Input path %s does not exist!' % (path,))
AssertionError: Input path hdfs://user/hadoop/in1/BCES_FY2014_clean.csv does not exist!
$
It works when mrjob reads out of the local file system, but that won't scale.