Note: This is NOT a duplicate of Can't read data in Presto - can in Hive
In an attempt to make my PySpark
application (which uses boto3
) work, I had to do following multiple times
- re-install
pip
- re-install
aws-sdk
(boto3
,botocore
,aws-cli
)
While I managed to make my application work, I ended up breaking the communication between Presto
and S3
, so that Presto can no longer read data from Hive EXTERNAL table
s stored on S3 (while Hive
can)
Upon running a simple query like SELECT COUNT(*) FROM my_db.my_table
in Presto, the /var/log/presto/server.log
file reports following stacktrace
2018-12-04T12:29:54.433+0530 WARN hive-hive-63 com.facebook.presto.hive.util.ResumableTasks ResumableTask completed exceptionally
java.lang.NoClassDefFoundError: Could not initialize class com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils
at com.amazon.ws.emr.hadoop.fs.s3n.S3Credentials.initialize(S3Credentials.java:45)
at com.amazon.ws.emr.hadoop.fs.HadoopConfigurationAWSCredentialsProvider.<init>(HadoopConfigurationAWSCredentialsProvider.java:26)
at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProviderChain(DefaultAWSCredentialsProviderFactory.java:44)
at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProvider(DefaultAWSCredentialsProviderFactory.java:28)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.getAwsCredentialsProvider(EmrFSProdModule.java:65)
...
see complete stacktrace here
I'd like to clarify that
- Only Presto seems to be affected;
Hive
,aws-cli
,Spark
etc. are able to read data as usual - My
EC2
instances have an attachedIAM Role
that permits reading data from all S3 buckets in my account (and writing to some specific buckets) - Earlier Presto had no complaints in reading from S3, the problem arose only after fiddling with environment
- Things run smoothly if I set location of my Hive external table to
HDFS
I've been through some related links to no avail
- Can't read data in Presto - can in Hive
- Considerations with Presto on Amazon EMR
- Authorizing Access to EMRFS Data in Amazon S3