Spark does not use correct configuration from core-site.xml

Question

When I try to read a parquet file from a specified location like /test with spark.read.parquet() i get an error saying file://test does not exist. When I add the core-site.xml as a resource in code with

sc.hadoopConfiguration.addResource(new Path(<path-to-core-site.xml>))

it does look in the hdfs. However I don't want to add the resource in code. My question is how do I make sure spark reads the core-site.xml and uses hdfs as default file system.

I've setup an ubuntu 18.04.2LTS server with hadoop 3, spark 2.4.2 and yarn as resourcemanager in a virtual machine. I've configured the core-site.xml with fs.defaultFS set to hdfs://localhost:9000. I've also configured the HADOOP_CONF_DIR in the bash file.

score 2 · Answer 1 · answered Jun 25 '19 at 15:57

Couple of options
1. Make sure that core-site.xml is available in driver's classpath. This way the core-site.xml will get loaded automatically.
2. If only setting the default filesystem uri is the requirement, we can set this in spark-defaults.conf or in the SparkConf object created for the application using spark.hadoop.fs.defaultFS and set its value to hdfs:///

score 1 · Answer 2 · answered Jun 25 '19 at 14:52

Well its a generic question with many possible answers. Ill try to answer as best as I can : https://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration It is explained in the link you can set the enviroment variable HADOOP_CONF_DIR to the dir that contains the core-site.xml and as long as you dont override it in spark-env.sh or something it should work.

BTW are you sure you did an export on the HADOOP_CONF_DIR because I know from experience it works with Spark 2.4.2 so if you think the core-site isnt loaded its probably because you didnt define the variable correctly or maybe your spark-env.sh masks your previous definition.

Spark does not use correct configuration from core-site.xml

2 Answers2