0

I am running a MapReduce job on an 8-core machine using MRJob. I wrote it using the Python API, and I run it as

$ python main.py -r local files/input*

There are ~750 input files in that folder, and when I run it that way, I believe mrjob launches as many mapper processes as there are input files. Even thought the machine has 8-cores, I see the load average as

load average: 551.26, 553.29, 556.82

Isn't there a way to limit the number of mappers, so that it only launches 8 (or 16) of them at a time? I couldn't find a config option for this, which suggests I'm doing something wrong somewhere?

Thanks!

EDIT

This is the rough outline of my MRjob task.

from mrjob.job import MRJob
from mrjob.step import MRStep

class MyMR(MRJob):

    def mapper_xml_init(self):
        self.abuf = ""

    def mapper_xml(self, _, line):

        self.abuf += line.strip()

        # ... work with self.abuf

        if acondition:
            self.abuf = ""


    def reducer_mean(self, _, values):
        # process some stuff

    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_xml_init, 
                   mapper=self.mapper_xml),  

            MRStep(reducer=self.reducer_mean),
        ]

if __name__ == '__main__':
    MyMR.run()
John Vandenberg
  • 474
  • 6
  • 16
user1496984
  • 10,957
  • 8
  • 37
  • 46
  • 1
    Can you give me some more details on how you access Hadoop in your main.py? – Stefan Papp Feb 09 '16 at 11:11
  • @StefanPapp, thanks for asking for more info! I added the rough outline of the job. I'm running this for all files in a folder. Any ideas why I get 700 processes? – user1496984 Feb 10 '16 at 22:03

0 Answers0