0

I have a program that iterates a mapper and a reducer n times consecutively. However, for each iteration, the mapper of each key-value pair computes a value that depends on n.

from mrjob.job import mrjob

class MRWord(mrjob):

  def mapper_init_def(self):

        self.count = {}


    def mapper_count(self, key, value):

            self.count[key] = 0

            print self.count[key]
      # print correctly  
            yield key, value


  def mapper_iterate(self, key, value):
      yield key, value
      print self.count[key]
  #error

  def reducer_iterate(self, key, value):
      yield key, value


  def steps(self):
      return [
        self.mr(mapper_init=self.mapper_init_def, mapper=self.mapper_count),

        self.mr(mapper=self.mapper_iterate, reducer=self.reducer_iterate)
      ]


if __name__ == '__main__':
    MRWord.run()

I defined a two-step mapper reducer, such that the first defines a class variable, self.count. The program produces an error, AttributeError: 'MRWord' object has no attribute 'count'. It seems each step defines an independent mrjob class object, and that variable cannot be shared. Is there another way to accomplish this?

Pippi
  • 2,451
  • 8
  • 39
  • 59
  • In my experience, these sorts of problems crop up from not properly converting your problem to the MR paradigm. Could you provide some more details on the algorithm you're implementing? My approach would be to emit the count *itself* and collect it in the reducer. Remember that you're working in a distributed computing environment - no guarantees where the data is. – pcoving Oct 03 '13 at 04:30

1 Answers1

1

Why don't you try defining your count in the class?

class MRWord(MRJob):
    count = []

and drop the

def mapper_init_def(self):
   self.count = {}
kgu87
  • 2,050
  • 14
  • 12