I am running a MapReduce job on an 8-core machine using MRJob. I wrote it using the Python API, and I run it as
$ python main.py -r local files/input*
There are ~750 input files in that folder, and when I run it that way, I believe mrjob launches as many mapper processes as there are input files. Even thought the machine has 8-cores, I see the load average as
load average: 551.26, 553.29, 556.82
Isn't there a way to limit the number of mappers, so that it only launches 8 (or 16) of them at a time? I couldn't find a config option for this, which suggests I'm doing something wrong somewhere?
Thanks!
EDIT
This is the rough outline of my MRjob task.
from mrjob.job import MRJob
from mrjob.step import MRStep
class MyMR(MRJob):
def mapper_xml_init(self):
self.abuf = ""
def mapper_xml(self, _, line):
self.abuf += line.strip()
# ... work with self.abuf
if acondition:
self.abuf = ""
def reducer_mean(self, _, values):
# process some stuff
def steps(self):
return [
MRStep(mapper_init=self.mapper_xml_init,
mapper=self.mapper_xml),
MRStep(reducer=self.reducer_mean),
]
if __name__ == '__main__':
MyMR.run()