I went through documentation and it says it is meant for aws, gcp. But they are also using it internally somehow right. So, there should be a way to make it run in our own locally created hadoop cluster in our own virtual box
some code for understanding how mrjob is used in code :-
class MovieSimilar(MRJob):
def mapper_parse_input(self, key, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield userID, (movieID, float(rating))
..........
..........
if __name__ == '__main__':
MovieSimilar.run()
With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag
The shell command I am using :-
bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop-
streaming.jar \
-file /<path_to_mapper>/MovieSimilar.py \
-mapper /<path_to_mapper>/MovieSimilar.py \
-reducer /<path_to_reducer>/MovieSimilar.py \
-input daily/<dataset-file>.csv \
-output daily/output
Note:- daily is my hdfs directory where datasets and result of programs get stored
Error message I am receiving :- more than 2 values required to unpack