Restricting a yarn container to execute only one task at a time

Question

I am running a Spark program using a hadoop cluster, which uses the yarn scheduler to run the tasks. However, I notice a strange behavior. Yarn sometimes kills a task complaining out of memory error, whereas if I execute the tasks in rounds, that is, execute the same number of tasks as containers/executors, let them finish, and then execute the next group of tasks, it runs fine, meaning that tasks aren't using any more memory than allowed in the containers. So, I'm suspecting that yarn is trying to run more than one tasks in parallel in a container, which is why a container goes out of memory. Is there a way to restrict this behavior and tell yarn to run only one task at a time in a container.

Can you add more details about the two approaches you followed like, how many executors and executor memory. How are you running different group of tasks(is it memory based?). Also are any other applications running on cluster which uses YARN — Ramzy, Jun 23 '16 at 18:24
Basically in the first approach, I just use map. In the second approach, I run the program multiple times, each time, with number of tasks equal to number of executors. When I do that, it works fine, but when I simply use map and run it in one go, it fails. — pythonic, Jun 23 '16 at 18:27

score 3 · Answer 1 · answered Jun 23 '16 at 20:18

3

In general each YARN container that Spark requests corresponds directly to one "executor" and even though YARN may report 1 CPU allocated per container, under the hood Spark uses the spark.executor.cores setting to determine the number of concurrent tasks packed into single executor/container processes.

So simply set spark.executor.cores=1 and each YARN container will only work on one task at a time. This can be done either as a spark-submit configuration like --conf spark.executor.cores=1 or you can put it in conf/spark-defaults.conf (on most standard hadoop installations this would be inside /etc/spark/conf/spark-defaults.conf).

Note that there may still be multiple YARN containers per machine; if you want to further limit 1 task at a time per machine you'd also need to expand spark.executor.memory to be the amount of memory available on each machine (allocated to YARN NodeManagers running on that machine; YARN will refuse to pack any containers larger than what you've told the NodeManager it's allowed to use even if physical memory is larger). Or you may find that you simple need to carve up your machine to slightly larger chunks, so you can just play with that memory setting to find the right memory size without sacrificing too much parallelism.

answered Jun 23 '16 at 20:18

Dennis Huo

10,517
27
43

HI @Dennis Huo, why not set the executors to 1. The cores are just on top of executor handling the parallelism. – Ramzy Jun 23 '16 at 20:32
Do you mean setting `spark.executor.instances=1`? In the question it says that YARN is killing containers for exceeding memory limits, which indicates that the problem lies in the *shaping* of the Spark execution environment inside each container, rather than being how the Spark application packs into YARN itself. For example, if spark.executor.cores is currently defaulted to 16 in the config, requesting `spark.executor.instances=1` only accomplishes limiting the application to run in a single YARN container, but it'll still try to do 16 things in parallel inside whatever memory was configured – Dennis Huo Jun 23 '16 at 20:43
In this case, presumably pythonic still wants to take advantage of whatever parallelism is available, and just needs to make sure the shape of each container accommodates the task correctly. So say each machine has 16 cores and 16 GB of ram, but each task needs to load 4GB into memory. If spark.executor.instances=1,spark.executor.cores=16,spark.executor.memory=16G, then Spark will grab 1 container and try to run 16 tasks at a time. Then it'll get kicked out by YARN because those 16 tasks will try to use a total of 16 * 4GB == 64GB in that container. – Dennis Huo Jun 23 '16 at 20:46
Alternatively, suppose we set spark.executor.instances=1,spark.executor.cores=1,spark.executor.memory=16G. Then Spark will grab the 1 container and try to run 1 task at a time, so it'll run fine, but only use 4 out of 16GB there. – Dennis Huo Jun 23 '16 at 20:46
1

In such a case it's more ideal to just set spark.executor.instances=999,spark.executor.cores=1,spark.executor.memory=4GB. On that 16GB machine, YARN would ask for 4 executors, each only doing 1 task at a time which gets to use the full 4GB per executor. And then furthermore, if the cluster actually has, say, 10 machines available, then spark.executor.instances=999 lets you use all 10 machines, packing 4 executors per machine for 40 executors total, and each executor gets to use 4GB. – Dennis Huo Jun 23 '16 at 20:48
Thanks for more info. I got the cores and memory part, but what is 999 useful for in this case – Ramzy Jun 23 '16 at 21:07
I just got into the habit of specifying "a large number" whenever I want my Spark job to occupy as much of the cluster as it can; Spark doesn't care if you specify a larger number than can be immediately satisfied, so it's always safe to over-ask. Its YARN scheduler logic sometimes chokes if you pass a number greater than about 99999, and usually I have clusters with only a few hundred cores so "999" is large enough to mean "all the containers that can fit". – Dennis Huo Jun 23 '16 at 21:12
This also is handy to do if you have a big job and aren't sure how long it'll take. I work on [Google Cloud Dataproc](https://cloud.google.com/dataproc/) and also use it for some processing jobs, and we optimize spinup time so you can manually add extra VMs in the middle of a job, and it only takes less than a minute for the new workers to join. When I specify a large number of executors for a job (or just use Dataproc's default dynamic-allocation), it means that as I add new machines, the job will immediately start using those extra VMs as well. – Dennis Huo Jun 23 '16 at 21:14
ok Got it. Dynamic Allocation in spark could also help i suppose to request as many executors as possible based on YARN resource allocation – Ramzy Jun 23 '16 at 21:21
Yeah nowadays dynamic allocation is the better way to avoid having to worry about requesting the right number of executors and it's pretty responsive in modern Spark versions. For reference to anyone who hasn't used it before, it typically just means setting `spark.dynamicAllocation.enabled=true` and then avoiding setting `spark.executor.instances` at all (since explicitly setting that auto-disables dynamic allocation in current Spark versions). – Dennis Huo Jun 23 '16 at 21:26
spark.executor.cores is set to 1 anyway in my case, and I know about that executor memory thing as well, and have set it up correctly. Now, I changed the code to use mapPartitions, so that I could manually run each task one by one in a partition, but still the same problem! This problem is extremely mysterious :(! – pythonic Jun 23 '16 at 23:33
Hmm, is there any evidence that Spark is really trying to run more than 1 task at a time in a container then? It isn't supposed to with `spark.executor.cores=1`, and if so would be a bug that should be filed with the core Spark project. Did you check your Spark UI pages to see single executors running multiple tasks at the same time on the "executors" tab? You should also double check the "configuration" tab. If it's not actually a concurrent-execution problem, then that mean the queueing up of further tasks is somehow causing the problem, which shouldn't normally affect worker nodes. – Dennis Huo Jun 23 '16 at 23:40
However, it can be worth a try in case the executors' overhead in queueing up more tasks is somehow contributing to the problem. In that case you should set `spark.yarn.executor.memoryOverhead` to a larger number if you haven't already; default is 384MB, try setting it to 1G. Also, are you sure YARN is killing tasks for running of memory, rather than some issue in the AppMaster or your driver? – Dennis Huo Jun 23 '16 at 23:41
I have tried every trick under the sun, but it still doesn't work. I even tried using as many number of containers as the tasks, but it still won't work. I use scala processes inside a task. Could that be causing problem? But in the code, I wait for those processes to finish. In that case, memory should be freed once they finish. – pythonic Jun 24 '16 at 15:16
Could you update your question with a sample code snippet for how you run scala processes in a task, and also the exact yarn error you get? Container limit errors mentioning pmem are different from ones mentioning vmem and it's all different from actual oom errors, so it'll be helpful the more details you can give – Dennis Huo Jun 24 '16 at 15:20

Restricting a yarn container to execute only one task at a time

1 Answers1