1

So I have Hadoop 2.7.1 installed on a 3 machine cluster. I'm trying to run an inverted index mapreduce job using MRJob and Hadoop Streaming.

Here's my configuration:

MRJob.SORT_VALUES = True 

def steps(self):
    JOBCONF_STEP1 = {
        "mapred.map.tasks":20,
        "mapred.reduce.tasks":10
    }
    return [MRStep(jobconf=JOBCONF_STEP1,
                mapper=self.mapper,
                reducer=self.reducer)
            ]

However, I've noticed in my output that I often get the same key going to two different reducers. This results in output that looks like this:

Key | Output
Z   | 2
X   | 1,2
X   | 3
Z   | 1

This means that one reducer is getting the X key and the values 1 and 2 while another reducer is also getting the X key and the value 3. But I want just one reducer to get the X key and all of associated values.

So the desired output is:

Key | Output
X   | 1,2,3
Z   | 1,2

How do I troubleshoot this issue?

Here is my MRJob code

%%writefile invertedIndex.py

import json
import mrjob
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRinvertedIndex(MRJob):

  MRJob.SORT_VALUES = True 

  def steps(self):
      JOBCONF_STEP1 = {
          "mapred.map.tasks":20,
          "mapred.reduce.tasks":10
      }
      return [MRStep(jobconf=JOBCONF_STEP1,
                  mapper=self.mapper,
                  reducer=self.reducer)
              ]

  def mapper(self,_,line):
      key, stripe = line.split("\t")
      stripe = json.loads(stripe)
      for w in stripe:
          yield w, key

  def reducer(self,key,values):
      d = [v for v in values]
      yield key,d

  if __name__ == '__main__':
      MRinvertedIndex.run() enter code here
Jack
  • 486
  • 2
  • 5
  • 19

1 Answers1

0

Figured it out. THe problem was that MRJob was setting the following by default:

'stream.num.map.output.key.fields': '2'

I resolved the problem by explicitly setting in jobconf:

'stream.num.map.output.key.fields': '1'

I don't know how 2 got to be the default for this setting, but at least I solved my problem

Jack
  • 486
  • 2
  • 5
  • 19