So I have Hadoop 2.7.1 installed on a 3 machine cluster. I'm trying to run an inverted index mapreduce job using MRJob and Hadoop Streaming.
Here's my configuration:
MRJob.SORT_VALUES = True
def steps(self):
JOBCONF_STEP1 = {
"mapred.map.tasks":20,
"mapred.reduce.tasks":10
}
return [MRStep(jobconf=JOBCONF_STEP1,
mapper=self.mapper,
reducer=self.reducer)
]
However, I've noticed in my output that I often get the same key going to two different reducers. This results in output that looks like this:
Key | Output
Z | 2
X | 1,2
X | 3
Z | 1
This means that one reducer is getting the X key and the values 1 and 2 while another reducer is also getting the X key and the value 3. But I want just one reducer to get the X key and all of associated values.
So the desired output is:
Key | Output
X | 1,2,3
Z | 1,2
How do I troubleshoot this issue?
Here is my MRJob code
%%writefile invertedIndex.py
import json
import mrjob
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRinvertedIndex(MRJob):
MRJob.SORT_VALUES = True
def steps(self):
JOBCONF_STEP1 = {
"mapred.map.tasks":20,
"mapred.reduce.tasks":10
}
return [MRStep(jobconf=JOBCONF_STEP1,
mapper=self.mapper,
reducer=self.reducer)
]
def mapper(self,_,line):
key, stripe = line.split("\t")
stripe = json.loads(stripe)
for w in stripe:
yield w, key
def reducer(self,key,values):
d = [v for v in values]
yield key,d
if __name__ == '__main__':
MRinvertedIndex.run() enter code here