I have the following program:
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
def mapper_get_ratings(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield rating, 1
def reducer_count_ratings(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
RatingsBreakdown.run()
and I am trying to run it on Ubuntu 18.04 with:
sudo python3 RatingsBreakdown.py -r hadoop --hadoop-bin /usr/local/hadoop/bin/hadoop u.data
where u.data is the data source.
The programs stops and I keep getting the following error:
OSError: Could not mkdir
hdfs:///user/root/tmp/mrjob/RatingsBreakdown.root.20191110.010957.606661/files/wd
When I try running the mkdir command manually I get:
mkdir: Incomplete HDFS URI, no host: hdfs:///user/root/tmp/mrjob/RatingsBreakdown.root.20191110.010957.606661/files/w
I need to mention that I have a functional Hadoop installation (it works with Java-based programs) and the Python environment is also set well. If I don't use the hadoop runner the program executes correctly. It seems that there's an interaction problem between Python (MRJob) and Hadoop.
I searched and searched but can't seem to find anything helpful. Please help me! Thanks