0

I have a task that would benefit from more cores but the standalone scheduler launches it when only a subset are available. I’d rather use all cluster cores on this task.

Is there a way to tell the scheduler to finish everything before allocating resources to a task? Put another way the DAG would be better for this job if it ended all paths before executing a task or waited until more cores were available. Perhaps a way to hint that a task is fat? I am not and do not wish to run Yarn.

Succinctly: I need to run this map task on an otherwise idle cluster so it has all resources/cores. Is there any way to do this? Even a hacky answer would be appreciated.

Any ideas?

pferrel
  • 5,673
  • 5
  • 30
  • 41
  • "Task" in spark has a specific meaning that is very different from what you are using it for. "Application" would probably be a better word. – puhlen Jan 13 '17 at 16:08

2 Answers2

0

Dynamic resource allocation might be what you are looking for. It scales the number of executors registered with this application up and down based on the workload.

You can enable it by passing a config parameter to the SparkSession e.g. like:

val spark = SparkSession
  .builder()
  .appName("MyApp")
  .config("spark.dynamicAllocation.enabled","true")
  .config("spark.shuffle.service.enabled","true")
  .getOrCreate()

See this: http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation for more details.

  • Maybe, I'll consider it, thanks. But actually I have 384 cores, all working away fine, then the scheduler launches this RDD.map when only 10 are available so it takes forever, much longer than it takes to complete all the other tasks. After everything else in the super task it set there spinning away with only 10 cores the rest idle. – pferrel Jan 14 '17 at 00:05
  • How many partitions does the RDD have? Standalone scheduler (or YARN) only allocates resources, but it does not control the execution flow. The execution is control by the `DAGScheduler` and parallelism depends on your partitioning for the most part – Alexey Svyatkovskiy Jan 14 '17 at 03:16
-1

You would have to manually check YARN via REST api to see when there are no applications running.

GET http://<rm http address:port>/ws/v1/cluster/metrics
{
  "clusterMetrics":
  {
  "appsSubmitted":0,
  "appsCompleted":0,
  "appsPending":0,
  "appsRunning":0,
  "appsFailed":0,
  "appsKilled":0,
  "reservedMB":0,
  "availableMB":17408,
  "allocatedMB":0,
  "reservedVirtualCores":0,
  "availableVirtualCores":7,
  "allocatedVirtualCores":1,
  "containersAllocated":0,
  "containersReserved":0,
  "containersPending":0,
  "totalMB":17408,
  "totalVirtualCores":8,
  "totalNodes":1,
  "lostNodes":0,
  "unhealthyNodes":0,
  "decommissionedNodes":0,
  "rebootedNodes":0,
  "activeNodes":1
  }
}

When there are no pending or running apps, then you could run your script. I would just create a shell script that was in a while loop + sleep and wait for them to both be 0.

You could also look for available memory/cores as well. In fact I would go that route so that you're not always waiting and you just guarantee enough resources.

Joe Widen
  • 2,378
  • 1
  • 15
  • 21
  • 1) rather not use yarn, 2) this is a task inside a Job. Not sure of the terminology but it is the second layer into details on the GUI. It is actually the closure of a single RDD.map operation and takes forever because in the current DAG it gets only a 10 cores when there are over 500 in the dedicated cluster. – pferrel Jan 14 '17 at 00:01
  • I misunderstood the question. So you're saying that spark is only using 10 of the 400 or so available cores? If thats the case, you need to make sure you have the same number of partitions as cores. You can do a repartition(num_cores) on the rdd before the map task and that will use them all, as long as you requested all the cores before starting the job. – Joe Widen Jan 14 '17 at 18:30