2

I'm trying to run the Mrjob example from the book Hadoop with Python on my laptop, in pseudo distributed mode.

(the file salaries.csv can be found here)

So I can start the namenode and the datanode:

start-dfs.sh

returns:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-namenode-me-Notebook-PC.out
localhost: starting datanode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-datanode-me-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/me/hadoop-2.7.3/logs/hadoop-me-secondarynamenode-me-Notebook-PC.out

I also have no problem creating the input file structure and copying salaries.csv unto the hdfs:

hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/
hdfs dfs -ls /user/me/input/

returns:

Found 1 items
-rw-r--r--   3 me supergroup    1771685 2016-12-24 15:57 /user/me/input/salaries.csv

I also make top_salaries.py executable:

sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py

lauching top_salaries.py in local mode also works:

python2 top_salaries.py -r local salaries.csv > answer.csv

returns:

No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_salaries.me.20161224.195052.762894
Running step 1 of 1...
Counters: 1
    warn
        missing gross=3223
Counters: 1
    warn
        missing gross=3223
Streaming final output from /tmp/top_salaries.me.20161224.195052.762894/output...
Removing temp directory /tmp/top_salaries.me.20161224.195052.762894...

however, running this job on the hadoop (putting things together) python2 top_salaries.py -r hadoop hdfs:///user/me/input/salaries.csv returns:

No configs found; falling back on auto-configuration
Looking for hadoop binary in $PATH...
Found hadoop binary: /home/me/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/me/hadoop-2.7.3...
Found Hadoop streaming jar: /home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
Creating temp directory /tmp/top_salaries.me.20161224.195201.967990
Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/...
Running step 1 of 1...
  session.id is deprecated. Instead, use dfs.metrics.session-id
  Initializing JVM Metrics with processName=JobTracker, sessionId=
  Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
  Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001
  Error launching job , bad input path : File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip
  Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 failed: Command '['/home/me/hadoop-2.7.3/bin/hadoop', 'jar', '/home/me/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/top_salaries.py#top_salaries.py', '-input', 'hdfs:///user/me/input/salaries.csv', '-output', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/output', '-mapper', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --mapper', '-combiner', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python top_salaries.py --step-num=0 --reducer']' returned non-zero exit status 512

Edit:

this is my core-site.xml:

<configuration>
 <property>         
    <name>fs.defaultFS</name>         
    <value>hdfs://localhost:9000</value>    
 </property>
</configuration>

and this is my hdfs-site.xml:

<configuration>
    <property>
       <name>dfs.namenode.name.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/namenode</value>
    </property>
    <property>
       <name>dfs.datanode.data.dir</name>
       <value>/home/me/Desktop/work/cv/hadoop/datanode</value>
    </property>
</configuration>

(the other xml config files, I have not edited/changed)

Edit2:

here is the python script (same as on the github link above)

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')

class salarymax(MRJob):

    def mapper(self, _, line):
        # Convert each line into a dictionary
        row = dict(zip(cols, [ a.strip() for a in csv.reader([line]).next()]))

        # Yield the salary
        yield 'salary', (float(row['AnnualSalary'][1:]), line)

        # Yield the gross pay
        try:
            yield 'gross', (float(row['GrossPay'][1:]), line)
        except ValueError:
            self.increment_counter('warn', 'missing gross', 1)

    def reducer(self, key, values):
        topten = []

        # For 'salary' and 'gross' compute the top 10
        for p in values:
            topten.append(p)
            topten.sort()
            topten = topten[-10:]

        for p in topten:
            yield key, p

    combiner = reducer

if __name__ == '__main__':
salarymax.run()
user189035
  • 5,589
  • 13
  • 52
  • 112
  • 1
    it can't find a file /tmp/hadoop-me/mapred/staging/me118248587/.staging/job_local118248587_0001/files/mrjob.zip#mrjob.zip check your file copy. – AdamSkywalker Dec 24 '16 at 16:08
  • 1
    xml files does not matter, I see paths starting with /tmp/hadoop-me , hdfs:///user/me, hdfs:///user/hduser, it's a bit messy. Job can't find mrjob.zip#mrjob.zip, check how you set input files for hadoop – AdamSkywalker Dec 24 '16 at 17:35
  • Ha! Good catch. But what do I do to fix this? I can see it is messy now, but where do I set these directories so it's more tidy? – user189035 Dec 24 '16 at 19:10
  • 1
    use same user for hadoop, so all user names be the same. e.g. hdfs dfs -mkdir /user/me/ instead of hdfs dfs -mkdir /user/hduser/ and then check new error logs – AdamSkywalker Dec 24 '16 at 19:39
  • ok, I have replaced the `hdfs dfs -mkdir /user/hduser/` by `hdfs dfs -mkdir /user/me/` but somehow still get the same errors;( – user189035 Dec 24 '16 at 19:54
  • 1
    add your python script code – AdamSkywalker Dec 24 '16 at 19:57
  • in your logs I see Copying local files to hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/.. - this means your python library moves files to that folder on hdfs. In command startup I see '-files', 'hdfs:///user/me/tmp/mrjob/top_salaries.me.20161224.195201.967990/files/mrjob.zip#mrjob.zip. These are files available to job. When hadoop fails it says: File does not exist: /tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001/files/mrjob.zip#mrjob.zip – AdamSkywalker Dec 25 '16 at 10:26
  • first 2 paths are the same, that's good. but hadoop is for some reason looking local folder /tmp/hadoop-me/mapred/.. instead of taking hdfs input. actually there's no more hadoop/mapred folder anywhere in logs. there's some misconfiguration and I can't find it theoretically. there are several places to check like app master logs and xml configs, it requires a bit of patience – AdamSkywalker Dec 25 '16 at 10:32
  • @AdamSkywalker: just before the error, I read `Cleaning up the staging area file:/tmp/hadoop-me/mapred/staging/me553683497/.staging/job_local553683497_0001`...isn't the cleaning up to blame for the file not being found at the next stage? – user189035 Dec 26 '16 at 22:30
  • no, cleaning is the result of first error – AdamSkywalker Dec 26 '16 at 22:36
  • [ Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized ] couldn't be part of the issue ? – Romain Jouin Dec 29 '16 at 19:45
  • @romainjouin: I really don't know anything about hadoop/mrjob. I'm just trying to get the example to run. FWIW, `java -version returns` `openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) ` – user189035 Dec 29 '16 at 19:55
  • I have the same issue, trying to run the examples from the book "Hadoop with Python" – Alex Marandon Jan 06 '17 at 10:05
  • @AlexMarandon: thanks for your comment. I have posted this as an issue (since you confirmed it) to the book's [git](https://github.com/MinerKasch/HadoopWithPython/issues/1). Let's see if we get more info... – user189035 Jan 13 '17 at 18:14

1 Answers1

4

Ok. You need to edit the file core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>

and the file hdfs-site.xml as:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/edureka/hadoop-2.7.3/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/edureka/hadoop-2.7.3/datanode</value>
    </property>
</configuration>

and you need to edit hdfs-site.xml as:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/edureka/hadoop-2.7.3/datanode</value>
    </property>
</configuration>

and you need to create a mapred-site.xml file with content:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

and you need to edit yarn-site.xml to contain:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Then do:

start-dfs.sh
start-yarn.sh

then do:

hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/me/
hdfs dfs -mkdir /user/me/input/
hdfs dfs -put /home/me/Desktop/work/cv/hadoop/salaries.csv /user/me/input/

now doing:

sudo chmod a+x /home/me/Desktop/work/cv/hadoop/top_salaries.py
python2 top_salaries.py -r hadoop  hdfs:///user/me/input/salaries.csv > answer.csv

works.

Hamid Rouhani
  • 2,309
  • 2
  • 31
  • 45
user42397
  • 214
  • 1
  • 3
  • Could you please explain the reason for this error and how it is fixed through these configs? Thanking you – Jeena KK Feb 20 '22 at 17:59
  • Sorry this was many years ago. I don't remember. And Hadoop may have changed so much in the meantime (I wouldn't know I have not used it in years) I'm not even sure if the answer to your question would still apply today. – user189035 Feb 21 '22 at 18:56