Spark/YARN - not all nodes are used in spark-submit

Question

I have a Spark/YARN cluster with 3 slaves setup on AWS.

I spark-submit a job like this: ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster my.py And the final result is a file containing all the hostnames from all the slaves in a cluster. I was expecting I get a mix of hostnames in the output file, however, I only see one hostname in the output file. That means YARN never utilize the other slaves in the cluster.

Am I missing something in the configuration?

I have also included my spark-env.sh settings below.

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/

SPARK_EXECUTOR_INSTANCES=3
SPARK_WORKER_CORES=3

my.py

import socket
import time
from pyspark import SparkContext, SparkConf

def get_ip_wrap(num):
    return socket.gethostname()

conf = SparkConf().setAppName('appName')
sc = SparkContext(conf=conf)

data = [x for x in range(1, 100)]
distData = sc.parallelize(data)

result = distData.map(get_ip_wrap)
result.saveAsTextFile('hby%s'% str(time.time()))

score 0 · Answer 1 · answered May 29 '17 at 00:42

0

After I updated the following setting or spark-env.sh, all slaves are utilized.

SPARK_EXECUTOR_INSTANCES=3
SPARK_EXECUTOR_CORES=8

answered May 29 '17 at 00:42

user1187968

7,154
16
81
152

You may be overcommitting your memory. If a task doesn't require all the machines, then it'll scale down accordingly – OneCricketeer May 29 '17 at 00:53

Spark/YARN - not all nodes are used in spark-submit

1 Answers1