0

I have a Spark/YARN cluster with 3 slaves setup on AWS.

I spark-submit a job like this: ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster my.py And the final result is a file containing all the hostnames from all the slaves in a cluster. I was expecting I get a mix of hostnames in the output file, however, I only see one hostname in the output file. That means YARN never utilize the other slaves in the cluster.

Am I missing something in the configuration?

I have also included my spark-env.sh settings below.

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/

SPARK_EXECUTOR_INSTANCES=3
SPARK_WORKER_CORES=3

my.py

import socket
import time
from pyspark import SparkContext, SparkConf

def get_ip_wrap(num):
    return socket.gethostname()

conf = SparkConf().setAppName('appName')
sc = SparkContext(conf=conf)

data = [x for x in range(1, 100)]
distData = sc.parallelize(data)

result = distData.map(get_ip_wrap)
result.saveAsTextFile('hby%s'% str(time.time()))
user1187968
  • 7,154
  • 16
  • 81
  • 152

1 Answers1

0

After I updated the following setting or spark-env.sh, all slaves are utilized.

SPARK_EXECUTOR_INSTANCES=3
SPARK_EXECUTOR_CORES=8
user1187968
  • 7,154
  • 16
  • 81
  • 152