0

I don't understand how Spark handles or downloads the packages provided by the Scala interface.

For my specific case; I want to explicitly pass on AWS credentials to access some s3 buckets. The spark cluster runs Spark version 2.4.6 with Hadoop 2.9.2. Local environment runs Scala 2.11.12

import $ivy.`com.amazonaws:aws-java-sdk:1.11.199`
import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-aws:2.9.2`
import $ivy.`org.apache.spark::spark-sql:2.4.6`

import org.apache.spark.sql._
import org.apache.spark._

var appName = "read-s3-test"
var accessKeyId = "xxxxxxxxxxxxxx"
var secretAccessKey = "xxxxxxxxxxxxxx"
var sessionToken = "xxxxxxxxxxxxxx"

val conf = new SparkConf()
    .setAppName(appName)
    .setMaster("spark://my-spark-master-svc:7077")
    .set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .set("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:2.9.2,org.apache.hadoop:hadoop-common:2.9.2")
    .set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .set("spark.hadoop.fs.s3a.access.key", accessKeyId)
    .set("spark.hadoop.fs.s3a.secret.key", secretAccessKey)
    .set("spark.hadoop.fs.s3a.session.token", sessionToken)
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()

This will create a session, but when running any read commands on s3a paths it will complain java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics. When creating the session I can read from the logs that the config is probably not set:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/08/05 10:49:52 INFO SparkContext: Running Spark version 2.4.6
20/08/05 10:49:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/05 10:49:52 INFO SparkContext: Submitted application: read-s3-test
20/08/05 10:49:52 INFO SecurityManager: Changing view acls to: root
20/08/05 10:49:52 INFO SecurityManager: Changing modify acls to: root
20/08/05 10:49:52 INFO SecurityManager: Changing view acls groups to: 
20/08/05 10:49:52 INFO SecurityManager: Changing modify acls groups to: 
20/08/05 10:49:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/08/05 10:49:53 INFO Utils: Successfully started service 'sparkDriver' on port 46875.
20/08/05 10:49:53 INFO SparkEnv: Registering MapOutputTracker
20/08/05 10:49:53 INFO SparkEnv: Registering BlockManagerMaster
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/08/05 10:49:53 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b7fad649-d7d6-4b2e-b2c9-f54444e2fd22
20/08/05 10:49:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/08/05 10:49:53 INFO SparkEnv: Registering OutputCommitCoordinator
20/08/05 10:49:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/08/05 10:49:53 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://100.64.32.16:4040
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://my-spark-master-svc:7077...
20/08/05 10:49:53 INFO TransportClientFactory: Successfully created connection to my-spark-master-svc/172.20.99.118:7077 after 39 ms (0 ms spent in bootstraps)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200805104953-0007
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200805104953-0007/0 on worker-20200805082629-100.64.40.6-37063 (100.64.40.6:37063) with 2 core(s)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20200805104953-0007/0 on hostPort 100.64.40.6:37063 with 2 core(s), 1024.0 MB RAM
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200805104953-0007/1 on worker-20200805082549-100.64.8.0-42223 (100.64.8.0:42223) with 2 core(s)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20200805104953-0007/1 on hostPort 100.64.8.0:42223 with 2 core(s), 1024.0 MB RAM
20/08/05 10:49:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39367.
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200805104953-0007/0 is now RUNNING
20/08/05 10:49:53 INFO NettyBlockTransferService: Server created on 100.64.32.16:39367
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200805104953-0007/1 is now RUNNING
20/08/05 10:49:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/08/05 10:49:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: Registering block manager 100.64.32.16:39367 with 366.3 MB RAM, BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/08/05 10:49:53 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.

Some remarks:

  • the exact Pyspark equivalent just works fine on the cluster (Python 3.7.6 with Pyspark 2.4.4);
  • running on the local spark, instead of the cluster, also works fine.
  • To cope with the NativeCodeLoader warning, I already appended the LD library path: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native to $SPARK_HOME/conf/spark-env.sh; but this did not resolve throwing the warning or giving the above error.
Joost Döbken
  • 3,450
  • 2
  • 35
  • 79
  • Does this answer your question? [java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics) – stevel Aug 06 '20 at 20:48
  • no, the error is identical, but I have matching hadoop versions. – Joost Döbken Aug 10 '20 at 12:18

1 Answers1

0

This looks like an issue with classpath: another hadoop version is used in runtime. Can you double check which hadoop libraries you have as a dependency?

Actually have just found this link: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

ALincoln
  • 431
  • 1
  • 4
  • 12
  • How can I check my dependencies? I am using the latest version of this spark chart: https://github.com/bitnami/charts/tree/c7751eb5764e468e1854b58a1b8491d2b13e0a4a/bitnami/spark – Joost Döbken Aug 06 '20 at 08:07
  • In Spark UI you can see which jars are loaded, I believe it should be somewhere under Spark UI -> Environment tab (it should say something like "classpath entries"). It seems like Spark is trying to use another version of Hadoop in runtime which your job doesn't expect. Other than that try to change your code to use 'hadoop-aws:2.7.6' and 'hadoop-common:2.7.6' instead of 2.9.2 and see if that helps – ALincoln Aug 07 '20 at 09:58