We have a standalone Spring boot based Spark application where at the moment property spark.eventLog.dir
is set to an s3 location.
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("MyApp")
.set("spark.hadoop.fs.permissions.umask-mode", "000")
.set("hive.warehouse.subdir.inherit.perms", "false")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.speculation", "false")
.set("spark.eventLog.enabled", "true")
.set("spark.extraListeners", "com.ClassName");
sparkConf.set("spark.eventLog.dir", "s3a://my-bucket-name/eventlog");
This has been working as expected however now the bucket access has changed to access point, so now the URL has to be arn:aws:s3:<bucket-region>:<accountNumber>:accesspoint:<access-point-name>
e.g:
sparkConf.set("spark.eventLog.dir", "s3a://arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point/eventlog");
After this change we are getting following Stack trace while booting up this app:
java.lang.NullPointerException: null uri host.
at java.base/java.util.Objects.requireNonNull(Objects.java:246)
at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:71)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:522)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
Looking at the class S3xLoginHelper
, looks like it is failing to create a java.net.URI
object with the :
char in the URL string.
I have following following relevant maven dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
Update: Also, tried to add following in core-site.xml (also tried in hdfc-site.xml) as mentioned in the hadoop-aws documentation:
<property>
<name>fs.s3a.bucket.my-access-point.accesspoint.arn</name>
<value>arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point</value>
<description>Configure S3a traffic to use this AccessPoint</description>
</property>
And updated the code with sparkConf.set("spark.eventLog.dir", "s3a://my-access-point/eventlog");
This give a stack trace with java.io.FileNotFoundException: Bucket my-access-point does not exist
which indicates that it is not using those updated properties for spark.eventLog.dir
and treating my-access-point
as bucket name!