How to run mrjob library python map reduce in ubuntu standalone local hadoop cluster

Question

I went through documentation and it says it is meant for aws, gcp. But they are also using it internally somehow right. So, there should be a way to make it run in our own locally created hadoop cluster in our own virtual box

some code for understanding how mrjob is used in code :-

class MovieSimilar(MRJob):
 def mapper_parse_input(self, key, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    yield  userID, (movieID, float(rating))
    ..........
    ..........
if __name__ == '__main__':
  MovieSimilar.run()

With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag

The shell command I am using :-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop-
streaming.jar \
-file /<path_to_mapper>/MovieSimilar.py \
-mapper /<path_to_mapper>/MovieSimilar.py \
-reducer /<path_to_reducer>/MovieSimilar.py  \
-input daily/<dataset-file>.csv \
-output daily/output

Note:- daily is my hdfs directory where datasets and result of programs get stored

Error message I am receiving :- more than 2 values required to unpack

With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag — Ayush Singh, Nov 17 '20 at 05:58
Considering we cannot see your code or errors, it's hard to help with that — OneCricketeer, Nov 17 '20 at 14:57
Actually, I am expecting an hadoop and python map reduce expert to answer the question. Because my question is staight forward and has nothing to do with code. I am speaking about shell scripting here — Ayush Singh, Nov 18 '20 at 09:26
Once again, we don't know what command you have or the error message returned to you, but the answer to the original question is yes, and "2 values to unpack" is not related to shell scripting, it's a specific python error, so please show your code — OneCricketeer, Nov 18 '20 at 14:58
I have added the python code for reference . I just want to know how to run this in an ubuntu hadoop cluster locally setup — Ayush Singh, Nov 19 '20 at 14:28
And what command are you using to run the job? What does your input file look like? Based on the error, some line of your input file does not have 4 columns... Also, why not use Pyspark if you simply want to process a TSV file? — OneCricketeer, Nov 19 '20 at 15:01
By the way, your mapper function needs to be actually called `mapper` https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#writing-your-first-job — OneCricketeer, Nov 19 '20 at 15:06
I have added the shell command which I am using to execute , and my code personally has 0 errors . It is executing totally fine when running in normal machine or a single node. The issue is with my shell command. I dont know anyother shell command to run this file — Ayush Singh, Nov 19 '20 at 15:14
It tells you the command right here https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster — OneCricketeer, Nov 19 '20 at 15:22
If you want to use hdfs files, read this section https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#running-your-job-different-ways — OneCricketeer, Nov 19 '20 at 15:24

OneCricketeer · Accepted Answer · 2020-11-19T15:25:58.433

says it is meant for aws, gcp

Those are examples. It is not meant for those. Notice the -r local and -r hadoop flags for running a job

https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster

there should be a way to make it run in our own locally created hadoop cluster in our own virtual box

Setup your HADOOP_HOME, and HADOOP_CONF_DIR xml files to point at the cluster you want to run the code against, then using the -r hadoop runner flag, it'll find and run your code using the hadoop binary and hadoop-streaming jar file

more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag

Can't see your input, but this line would cause that error if there were less than three tabs on any line (and you don't need parentheses left of the equals)

(userID, movieID, rating, timestamp) = line.split('\t')

I suggest testing your code using the local/inline runner first

The shell command I am using :-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop- streaming.jar

Mrjob will build and submit that for you.

You only need to run python MovieSimilar.py with your input files

How to run mrjob library python map reduce in ubuntu standalone local hadoop cluster

1 Answers1