1

I have a AWS EMR cluster with Spark. I can connect to it (spark):

  • from master node after SSHing into it
  • from another AWS EMR cluster

But NOT able to connect to it:

  • from my local machine (macOS Mojave)
  • from non-emr machines like Metabase and Redash

I have read answers of this question. I have checked that folder permissions and disk space are fine on all the nodes. My assumption is I'm facing similar problem what James Wierzba is asking in the comments. However, I do not have enough reputation to add a comment there. Also, this might be a different problem considering it is specific to AWS EMR.

Connection works fine after SSHing to master node.

# SSHed to master node 
$ ssh -i ~/identityfile hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com

# on master node
$ /usr/lib/spark/bin/beeline -u 'jdbc:hive2://localhost:10001/default'
# it connects fine and I can run commands, for e.g., 'show databases;'

# Beeline version 1.2.1-spark2-amzn-0 by Apache Hive

Connection to this node works fine from master node of another EMR cluster as well.

However, connection does not work from my local machine (macOS Mojave), Metabase and Redash.

My local machine:

# installed hive (for beeline)
$ brew install hive

# Beeline version 3.1.1 by Apache Hive
# connect directly
# I have checked that all ports are open for my IP

$ beeline -u 'jdbc:hive2://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:10001/default'
# ERROR: ConnectException: Operation timed out 
#
# this connection timeout probably has something to do with spark accepting only localhost connections 
# I have allowed all the ports in AWS security group for my IP

# connect via port forwarding

# open a port
$ ssh -i ~/identityfile -Nf -L 10001:localhost:10001 hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com


$ beeline -u 'jdbc:hive2://localhost:10001/default'
# Failed to connect to localhost:10001
# Required field 'client_protocol' is unset!

$ beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
# org.apache.http.ProtocolException: The server failed to respond with a valid HTTP response

I have setup Metabase and Redash in ec2.

Metabase → connect using data source Spark SQL → results in java.sql.SQLException: org.apache.spark.SparkException: java.io.IOException: Failed to create local dir in /mnt/tmp/blockmgr*

Redash → connect using data source Hive → results in same error.

user954311
  • 41
  • 3
  • Have you checked your security group? the port should be opened. A connection timeout on Amazon is typically a non-opened port. – sebge2 May 07 '19 at 12:20
  • yes, I have all the traffic open from my IP in the security group. I have updated the question with that info. – user954311 May 08 '19 at 13:15

1 Answers1

0

You need to update the inbound rules of the security group attached to Master node of EMR. You will need to add the public IP address of your network provider. You can find your public IP address on the following website :

What is my IP

For more details on how to update the inbound rules with your IP address refer following AWS documentation :

Authorizing Inbound Traffic for Your Linux Instances

You should also check the outbound rules of your own network in case you are working in a restricted network environment.

So make sure you have outbound access in your network and inbound access in your EMR's master node security group for all the ports you want to access.

Harsh Bafna
  • 2,094
  • 1
  • 11
  • 21