1

I'm having issues connecting from an AWS EMR cluster with spark running to another AWS EMR cluster running presto.

The code - written in python - is:

jdbcDF = spark.read \
        .format("jdbc") \
        .option("driver", "com.facebook.presto.jdbc.PrestoDriver")\
        .option("url", "jdbc:presto://ec2-xxxxxxxxxxxx.ap-southeast-2.compute.amazonaws.com:8889/hive/data-lake") \
        .option("user", "hadoop") \
        .option("dbtable", "customer") \
        .load()\

deployed via aws emr add-steps with the option --packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\

Which when deployed throws the following error

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:237) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:330) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:250) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ... 4 more Caused by: java.io.IOException: Failed to connect to ip-xxxx-xxx.ap-southeast-2.compute.internal/xxx-xxxx:41885 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-xxxxxxxxx.ap-southeast-2.compute.internal/xxxxxx:41885 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more End of LogType:stderr

Whilst I've redacted the IP address above (safety first), it is the same internal IP address on the spark slave instance. It appears to be connecting to itself and having a connection issue.

I've opened up the ports in AWS EC2 security groups, allowing access from both spark master/slave to the presto instance.

If it helps, a quick node script written to test connectivity works

var client = new presto.Client({
  host: prestoEndpoint,
  user: 'hadoop',
  port: 8889,
});

client.execute({
  query: 'select * from customer',
  catalog: 'hive',
  schema: 'data-lake',
  source: 'nodejs-client',
  state: function(error, query_id, stats) {
     console.log({ message: 'status changed', id: query_id, stats: stats });
  },
  columns: function(error, data) {
     console.log({ resultColumns: data });
  },
  data: function(error, data, columns, stats) {
    console.log({data, columns});
  },
  success: function(error, stats) {
     console.log(error);
     console.log(JSON.stringify(stats, null,2));
  },
  error: function(error) {
    console.log(error);
  },
});

the key part of the error message seems to be

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-xxxxxxxxx.ap-southeast-2.compute.internal/xxxxxx:41885

saeed foroughi
  • 1,662
  • 1
  • 13
  • 25
Adam
  • 432
  • 5
  • 16
  • 1
    What is `com.facebook.presto:presto-jdbc:0.60`? Presto version for EMR should be `0.2xx` like 0.219, 0.225 and so on. 0.60 looks for really old presto version. – Lamanus Jan 27 '20 at 11:53
  • interesting, i'll give that a go @Lamanus. Thanks. – Adam Jan 27 '20 at 18:41
  • @Lamanus back on deck and gave it a go, looks like you could be onto something with the version number. I've updated to 0.255 and i'm receiving a new error message `AttributeError: 'DataFrame' object has no attribute '_get_object_id''` I'll keep digging and update with what i find. Thanks again. – Adam Jan 29 '20 at 15:45
  • @Lamanus bug mentioned above was totally unrelated. version upgrade did the trick. works now!! – Adam Jan 29 '20 at 16:23

1 Answers1

0

Problem was the version number of the prest-jdbc driver

I updated it from com.facebook.presto:presto-jdbc:0.60 to com.facebook.presto:presto-jdbc:0.225 so the full packages paramater is

--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.255\',\

thanks to @Lamanus for spotting that one

Adam
  • 432
  • 5
  • 16