0

I would like multiple mappers and reducers to be run in parallel. According to the formula to get number of concurrent tasks

min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, 
 yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores)

I'm expected to get 4 tasks running in parallel, but in reality, I just have one task that's running.

Also, I've looked up few other questions that might help with my situation and How to run MapReduce tasks in Parallel with hadoop 2.x? also says the same,

My file is 628mb and is of the type ORC. DFS block size for that is 256mb and as a result, i get 3 splits by default if I don't add mapreduce.input.fileinputformat.split.minsize as a parameter.

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<!-- YARN settings for lower and upper resource limits -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>4</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>4</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>8192</value>
    </property>
    <property>
         <main>yarn.nodemanager.resource.cpu-vcores</main>
         <value>4</value>
    </property>

</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
  <name>mapreduce.client.submit.file.replication</name>
  <value>1</value>
</property>

<property>
  <name>mapreduce.map.memory.mb</name>
  <value>2048</value>
</property>

<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>2048</value>
</property>

<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx1638m</value>
</property>

<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx1638m</value>
</property>

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>4</value>
</property>

<property>
    <name>mapreduce.map.cpu.vcores</name>
    <value>4</value>
</property>

<property>
    <name>mapreduce.reduce.cpu.vcores</name>
    <value>4</value>
</property>

</configuration>

I did check for free memory and it had 9gb available under free.

Could there be more configuration that is required?

user4157124
  • 2,809
  • 13
  • 27
  • 42
  • According to MR, the number of mappers is defined by the `InputFormat` based on the `inputSplit` (for each mapper there is one job unit) so i'm not sure you can do this by just mr configuration. I'd tried first with a custom input format. I think i have an example where you control de number of mappers in my personal library – Kenry Sanchez Aug 17 '20 at 15:43
  • Input split is 3 in my case with ```InputFormat``` as ORCInputFormat, yet I don't have 3 mappers running concurrently, just 1, – Vishwanth Iron Heart Aug 18 '20 at 05:24
  • but, in the end you get the correct number of mappers?? i mean, it follows one by one – Kenry Sanchez Aug 18 '20 at 06:39
  • From what I've seen, 3 splits generated 3 tasks with its own unique ID, and only one task runs at any given moment, What i would like to do and know is, to make multiple task run simultaneously. I also checked for Containers, and it indicates there are 2 running containers, and yet only 1 task is executed at a time. – Vishwanth Iron Heart Aug 18 '20 at 08:18
  • In that case, I'd like to see the log from yarn. Remember that yarn would always try to run the mappers in parallel or in the same execution by the limit of resources it handles. At least you have changed the scheduled policies or you don't have enough memory to run all the mappers in parallel i'm not seeing why it would not work – Kenry Sanchez Aug 19 '20 at 07:06
  • Currently do not have access to logs, have requested access for it. but are you implying that given the parameters it should ideally have multiple tasks running in parallel? – Vishwanth Iron Heart Aug 23 '20 at 11:38
  • As far as i know, multi-task running in parallel is a scheduled task. It depends of the scheduled policy but it must happen by default. Remember that reducers cannot run after mapper tasks are completed. – Kenry Sanchez Aug 23 '20 at 22:08
  • I guess you have the `scheduled handler service` activated. Right? – Kenry Sanchez Aug 24 '20 at 15:25
  • Im not sure I follow. I haven't modified it if it has a default value. – Vishwanth Iron Heart Aug 26 '20 at 07:55

0 Answers0