PySpark Can't Load Data via s3a

Question

I'm trying to load data from S3 using s3a (which, as far as I can tell is the only option these days). I'm getting an error (java.lang.NoClassDefFoundError: org/apache/hadoop/fs/statistics/IOStatisticsSource) that I can find nothing about online. I've done everything I can think of in terms of configuring things to use s3, but this error is seemingly pretty rare.

If someone can point me in the right direction, I'd appreciate it.

Here is the stack trace:

Traceback (most recent call last):
  File "/home/hdoop/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 737, in csv
  File "/home/hdoop/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/home/hdoop/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/home/hdoop/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o107.csv.
: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/statistics/IOStatisticsSource
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
    at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
    at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:576)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2532)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2497)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.statistics.IOStatisticsSource
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)

This may have already been answered. [Check it out.](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics) — Matt Andruff, Oct 14 '21 at 19:55
And your hadoop-common is 3.3.1 as well? https://issues.apache.org/jira/browse/HADOOP-17450 — mazaneicha, Oct 15 '21 at 16:07
@mazaneicha hadoop-common-3.3.1.jar hadoop-common-3.3.1-tests.jar hadoop-annotations-3.2.0.jar hadoop-auth-3.2.0.jar hadoop-aws-3.3.1.jar hadoop-client-3.2.0.jar hadoop-common-3.2.0.jar hadoop-hdfs-client-3.2.0.jar hadoop-mapreduce-client-common-3.2.0.jar hadoop-mapreduce-client-core-3.2.0.jar hadoop-mapreduce-client-jobclient-3.2.0.jar hadoop-yarn-api-3.2.0.jar hadoop-yarn-client-3.2.0.jar hadoop-yarn-common-3.2.0.jar hadoop-yarn-registry-3.2.0.jar hadoop-yarn-server-common-3.2.0.jar hadoop-yarn-server-web-proxy-3.2.0.jar parquet-hadoop-1.10.1.jar — SeaTea, Oct 15 '21 at 17:32
Why do you have both 3.3.1 and 3.2.0? Please refer to @MattAndruff 's comment. — mazaneicha, Oct 15 '21 at 17:38
Does this answer your question? [java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics) — stevel, Oct 15 '21 at 19:55

score 3 · Answer 1 · answered Oct 15 '21 at 18:21

3

This seams to be a mismatch in the jars versions and spark. You can use aws-java-sdk-bundle to have all the jars you may need with the same version.

Here's the link https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle

I'm using aws-java-sdk-bundle-1.11.874.jar with spark-3.1.2-bin-hadoop3.2 and works perfect.

answered Oct 15 '21 at 18:21

clairtonm

51
5

How do you invoke this such that the system ignores the mismatch? I tried passing this: --packages com.amazonaws:aws-java-sdk-bundle:1.12.89 to spark-submit, but that didn't do the trick. Same error as before. – SeaTea Oct 16 '21 at 18:08
The aws jars are very sensitive - personal experience - then you need to find a version that matchs with the spark version. I encountered the same error you are facing now, and tried differents aws jars versions. The version that I mentioned it was the one that worked for me, after serveral tries. Also, make sure that the jars are being downloaded. – clairtonm Oct 18 '21 at 13:34
Is this as simple as matching the version numbers of the hadoop-aws library with the version of spark (or Hadoop?), or is there some other way to determine what the right version is? – SeaTea Oct 20 '21 at 20:21
You can check the compiles dependencies. https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3 Here`s a link with some instructions how to match the versions using the compiles dependencies: https://notadatascientist.com/running-apache-spark-and-s3-locally/ – clairtonm Oct 20 '21 at 23:49

score 0 · Answer 2 · answered Oct 20 '21 at 20:36

0

It turns out that calling the packages parameter with the following versions worked for me:

--packages org.apache.hadoop:hadoop-aws:2.8.5,com.amazonaws:aws-java-sdk:1.11.659,org.apache.hadoop:hadoop-common:2.8.5

answered Oct 20 '21 at 20:36

SeaTea

51
2
11

PySpark Can't Load Data via s3a

2 Answers2