In a Spark (3.2.0) application, i need to change the replication factor for different files written to HDFS. For instance, i write some temporary files, and i want them to be written with a replication factor 1. Then, i write some files that are going to be persistent, and i want them to be written with a replication factor 2, sometimes 3.
However, as i tested; dfs.replication
in SparkContext.hadoopConfiguration
does not affect the replication factor of the file at all, whilst spark.hadoop.dfs.replication
sets it (or changes the default replication that is set in HDFS side) only when the SparkSession
is created with a previously defined SparkConf
as below.
val conf = new SparkConf()
conf.set("spark.hadoop.dfs.replication", "1")) // works but cannot be changed later.
val sparkSession: SparkSession = SparkSession.builder.config(conf).getOrCreate()
Having made some search on the documentation, i came across with the configuration spark.sql.legacy.setCommandRejectsSparkCoreConfs
that is added to core conf in Spark 3.0, and by default is set to true
, and to change some other core confs it is needed to be set to false
explicitly while the SparkSession
is created. Even if i did that and prevent the errors like org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config
, setting replication factor to a different value by setting both configs in a function like below
def setReplicationFactor(rf: Short): Unit = {
val activeSparkSession = SparkSession.getActiveSession.get
activeSparkSession.conf.set("spark.hadoop.dfs.replication", rf.toString)
activeSparkSession.sparkContext.hadoopConfiguration.set("dfs.replication", rf.toString)
}
does not change the files being written with the updated SparkConf
and SparkContext.hadoopConfiguration
.
Is there any way to achieve writing files to HDFS with different replication factors within the same spark session?