10

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:

ssh -i <key> hadoop@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com Once ssh'd into the master node, I can access PySpark via pyspark. Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077.

However, I am still unable to connect my local PySpark instance to my cluster:

MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark

The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.

Does anyone know how to successfully create a remote connection like the one I am describing above?

Soubhik
  • 103
  • 4

2 Answers2

0

I have done something similar where in I connected the spark installed in an ec2 machine to the Master node of a Hadoop cluster.

Make sure the access from ec2 to Hadoop master node is properly configured

import os
from pyspark.sql import SparkSession
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
os.environ['YARN_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
spark = SparkSession.builder \
  .appName("MySparkApp") \
  .master("yarn") \
  .config("spark.hadoop.fs.defaultFS", "<master_ip>:9000") \
  .config("spark.hadoop.yarn.resourcemanager.address", "<master_ip>:8040") \
  .config("spark.hadoop.yarn.resourcemanager.scheduler.address", "<master_ip>:8030") \
  .getOrCreate()
visuman
  • 140
  • 2
  • 12
-1

Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 3
    Can you please explain why? I'd like to do this too, but need to be able to explain why I can't do it if this approach won't work. – GeorgeWilson Apr 08 '17 at 10:09
  • I honestly don't mind down-voting but you can have a least the decency to comment why considering the answer is valid... – eliasah Jun 05 '17 at 18:12
  • This is starting to be funnny... You don't like the answer and you downvote ? It's a valid answer ! – eliasah Dec 05 '17 at 17:05
  • 1
    It's not an answer – a SO answer should at the very least contain a link to some source supporting your answer, better yet paraphrase that source in your answer (for quick access and in case the link goes down). – Markus Shepherd Aug 28 '19 at 07:54
  • @MarkusShepherd take it or leave it. This is a community wiki. There is no documentation to support it. You are welcome to try proving it wrong with some "sources" if you can find or improving the answer if it still stands. – eliasah Aug 28 '19 at 14:20
  • 1
    It actually is wrong as you can connect from your local machine to EMR via Livy. But that's not the point. The point is that your answer does not meet SO's standards (which I paraphrased above), hence people are downvoting it. – Markus Shepherd Aug 29 '19 at 08:58
  • I have done something similar where in i wanted the spark installed in an ec2 machine to Master node of a Hadoop cluster. ``` ``` – visuman Jul 14 '23 at 17:18