Mrjob in hadoop mode: Error launching job , bad input path : File does not exist

Question

I'm trying to run the Mrjob example from the book Hadoop with Python on my laptop, in pseudo distributed mode.

(the file salaries.csv can be found here)

So I can start the namenode and the datanode:

start-dfs.sh

returns:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-namenode-me-Notebook-PC.out
localhost: starting datanode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-datanode-me-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-secondarynamenode-me-Notebook-PC.out

I also have no problem creating the input file structure and copying salaries.csv unto the hdfs:

hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
hdfs dfs -ls /user/me/input/

returns:

Found 1 items
-rw-r--r--   3 me supergroup    1771685 2016-12-24 15:57 /user/me/input/salaries.csv

I also make top_salaries.py executable:

sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py

lauching top_salaries.py in local mode also works:

python2 top_salaries.py -r local salaries.csv > answer.csv

returns:

No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_salaries.me.20161224.195052.762894
Running step 1 of 1...
Counters: 1
    warn
        missing gross=3223
Counters: 1
    warn
        missing gross=3223
Streaming final output from /tmp/top_salaries.me.20161224.195052.762894/output...
Removing temp directory /tmp/top_salaries.me.20161224.195052.762894...

however, running this job on the hadoop (putting things together) python2 top_salaries.py -r hadoop hdfs:///user/me/input/salaries.csv returns:

No configs found; falling back on auto-configuration
Looking for hadoop binary in $PATH...
Found hadoop binary: /home/me/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/me/hadoop-2.7.3...
Found Hadoop streaming jar: /home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
Creating temp directory /tmp/top_salaries.me.20161224.195201.967990
Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/...
Running step 1 of 1...
  session.id is deprecated. Instead, use dfs.metrics.session-id
  Initializing JVM Metrics with processName=JobTracker, sessionId=
  Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
  Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001
  Error launching job , bad input path : File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip
  Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 failed: Command '['/home/me/hadoop-2.7.3/bin/hadoop', 'jar', '/home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/top_salaries.py#top_salaries.py', '-input', 'hdfs:///user/me/input/salaries.csv', '-output', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/output', '-mapper', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --mapper', '-combiner', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --reducer']' returned non-zero exit status 512

Edit:

this is my core-site.xml:

<configuration>
 <property>         
    <name>fs.defaultFS</name>         
    <value>hdfs://localhost:9000</value>    
 </property>
</configuration>

and this is my hdfs-site.xml:

<configuration>
    <property>
       <name>dfs.namenode.name.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/namenode</value>
    </property>
    <property>
       <name>dfs.datanode.data.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/datanode</value>
    </property>
</configuration>

(the other xml config files, I have not edited/changed)

Edit2:

here is the python script (same as on the github link above)

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')

class salarymax(MRJob):

    def mapper(self, _, line):
        # Convert each line into a dictionary
        row = dict(zip(cols, [ a.strip() for a in csv.reader([line]).next()]))

        # Yield the salary
        yield 'salary', (float(row['AnnualSalary'][1:]), line)

        # Yield the gross pay
        try:
            yield 'gross', (float(row['GrossPay'][1:]), line)
        except ValueError:
            self.increment_counter('warn', 'missing gross', 1)

    def reducer(self, key, values):
        topten = []

        # For 'salary' and 'gross' compute the top 10
        for p in values:
            topten.append(p)
            topten.sort()
            topten = topten[-10:]

        for p in topten:
            yield key, p

    combiner = reducer

if __name__ == '__main__':
salarymax.run()

it can't find a file /tmp/hadoop-me/mapred/staging/me118248587/.staging/job_local118248587_0001/files/mrjob.zip#mrjob.zip check your file copy. — AdamSkywalker, Dec 24 '16 at 16:08
xml files does not matter, I see paths starting with /tmp/hadoop-me , hdfs:///user/me, hdfs:///user/hduser, it's a bit messy. Job can't find mrjob.zip#mrjob.zip, check how you set input files for hadoop — AdamSkywalker, Dec 24 '16 at 17:35
Ha! Good catch. But what do I do to fix this? I can see it is messy now, but where do I set these directories so it's more tidy? — user189035, Dec 24 '16 at 19:10
use same user for hadoop, so all user names be the same. e.g. hdfs dfs -mkdir /user/me/ instead of hdfs dfs -mkdir /user/hduser/ and then check new error logs — AdamSkywalker, Dec 24 '16 at 19:39
ok, I have replaced the `hdfs dfs -mkdir /user/hduser/` by `hdfs dfs -mkdir /user/me/` but somehow still get the same errors;( — user189035, Dec 24 '16 at 19:54
in your logs I see Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/.. - this means your python library moves files to that folder on hdfs. In command startup I see '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip. These are files available to job. When hadoop fails it says: File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip — AdamSkywalker, Dec 25 '16 at 10:26
first 2 paths are the same, that's good. but hadoop is for some reason looking local folder /tmp/hadoop-me/mapred/.. instead of taking hdfs input. actually there's no more hadoop/mapred folder anywhere in logs. there's some misconfiguration and I can't find it theoretically. there are several places to check like app master logs and xml configs, it requires a bit of patience — AdamSkywalker, Dec 25 '16 at 10:32
@AdamSkywalker: just before the error, I read `Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001`...isn't the cleaning up to blame for the file not being found at the next stage? — user189035, Dec 26 '16 at 22:30
[ Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized ] couldn't be part of the issue ? — Romain Jouin, Dec 29 '16 at 19:45
@romainjouin: I really don't know anything about hadoop/mrjob. I'm just trying to get the example to run. FWIW, `java -version returns` `openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) ` — user189035, Dec 29 '16 at 19:55
I have the same issue, trying to run the examples from the book "Hadoop with Python" — Alex Marandon, Jan 06 '17 at 10:05
@AlexMarandon: thanks for your comment. I have posted this as an issue (since you confirmed it) to the book's [git](https://github.com/MinerKasch/HadoopWithPython/issues/1). Let's see if we get more info... — user189035, Jan 13 '17 at 18:14

score 4 · Accepted Answer · edited Jun 30 '17 at 16:28

Ok. You need to edit the file core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>

and the file hdfs-site.xml as:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/edureka/hadoop-2.7.3/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/edureka/hadoop-2.7.3/datanode</value>
    </property>
</configuration>

and you need to edit hdfs-site.xml as:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/edureka/hadoop-2.7.3/datanode</value>
    </property>
</configuration>

and you need to create a mapred-site.xml file with content:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

and you need to edit yarn-site.xml to contain:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Then do:

start-dfs.sh
start-yarn.sh

then do:

hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/

now doing:

sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py
python2 top_salaries.py -r hadoop  hdfs:///user/me/input/salaries.csv > answer.csv

works.

Could you please explain the reason for this error and how it is fixed through these configs? Thanking you — Jeena KK, Feb 20 '22 at 17:59
Sorry this was many years ago. I don't remember. And Hadoop may have changed so much in the meantime (I wouldn't know I have not used it in years) I'm not even sure if the answer to your question would still apply today. — user189035, Feb 21 '22 at 18:56

Mrjob in hadoop mode: Error launching job , bad input path : File does not exist

Edit:

Edit2:

1 Answers1

Linked