8

I have a Spark cluster on DC/OS and I am running a Spark job that reads from S3. The versions are the following:

  • Spark 2.3.1
  • Hadoop 2.7
  • The dependency for AWS connection: "org.apache.hadoop" % "hadoop-aws" % "3.0.0-alpha2"

I read in the data by doing the following:

`val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
    hadoopConf.set("fs.s3a.endpoint", Config.awsEndpoint)
    hadoopConf.set("fs.s3a.access.key", Config.awsAccessKey)
    hadoopConf.set("fs.s3a.secret.key", Config.awsSecretKey)
    hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

val data = sparkSession.read.parquet("s3a://" + "path/to/file")

` The error I am getting is:

Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:215)
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:138)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:170)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:321)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:809)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:182)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:207)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

This job only fails if I submit it as a JAR to the cluster. If I run the code locally or in a docker container, it does not fail and is perfectly able to read in the data.

I would be very grateful if anyone could help me with this!

Giselle Van Dongen
  • 465
  • 1
  • 9
  • 18

3 Answers3

6

This is one of the stack traces you get to see when you mix Hadoop-* jars.

As the S3A docs say

Critical: Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

Randomly changing hadoop- and aws- JARs in the hope of making a problem “go away” or to gain access to a feature you want, will not lead to the outcome you desire.

Community
  • 1
  • 1
stevel
  • 12,567
  • 1
  • 39
  • 50
0

I was also facing problem(not exactly same exception) to run docker image on spark cluster(kubernetes), which was perfectly running locally. Then I have changed build.sbt assembly and hadoop verion.

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0" 
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.5.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.9"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.271"
dependencyOverrides += "org.apache.hadoop" % "hadoop-hdfs" % "3.1.1"
dependencyOverrides += "org.apache.hadoop" % "hadoop-client" % "3.1.1"

assemblyMergeStrategy in assembly := {
 case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
 case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
 case "log4j.properties" => MergeStrategy.discard
 case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
 case PathList("META-INF", "services", "org.apache.hadoop.fs.s3a.S3AFileSystem") => MergeStrategy.filterDistinctLines
 case "reference.conf" => MergeStrategy.concat
 case _ => MergeStrategy.first
}

But not sure this will work for you or not. Because the same code is not working with aws-EKS machine and same exception throws if hadoop verion is 2.8.1. Hadoop and aws version is also same,as working fine locally, so trying to reach aws team for help.

0

Seems like the version of hadoop-aws that you are using is not compatible with the version of hadoop. Can you try with hadoop-aws-2.7.3 this version of hadoop-aws and aws-java-sdk-1.11.123 this version of aws java sdk. Hope this will solve your problem

Sachin Janani
  • 1,310
  • 1
  • 17
  • 33
  • 2
    you can't use hadoop-aws 2.7.x with any of the aws 1.11 releases; AWS API changed too much. – stevel Oct 18 '18 at 15:45
  • @SteveLoughran I have was suggesting the right version of aws and ask to use. I have used it and it works fine – Sachin Janani Oct 19 '18 at 07:16
  • 2
    And I was suggesting that hadoop-aws-2.7.x was built against AWS-SDK 1.7.4; Hadoop 2.8 is at 1.10. As I recall, we only switched to 1.11 in Hadoop 2.9 https://jira.apache.org/jira/browse/HADOOP-13050 . If you have used it and it works, you haven't tried to run the hadoop-aws integration tests. – stevel Oct 20 '18 at 18:30
  • I did not do much test about the version of aws-java-sdk, however your suggestion of version compatible and your discussion here helps me to find a version that could solve the problem. Thanks a lot! – buxizhizhoum Nov 13 '18 at 01:33