2

In a Spark (3.2.0) application, i need to change the replication factor for different files written to HDFS. For instance, i write some temporary files, and i want them to be written with a replication factor 1. Then, i write some files that are going to be persistent, and i want them to be written with a replication factor 2, sometimes 3.

However, as i tested; dfs.replication in SparkContext.hadoopConfiguration does not affect the replication factor of the file at all, whilst spark.hadoop.dfs.replication sets it (or changes the default replication that is set in HDFS side) only when the SparkSession is created with a previously defined SparkConf as below.

val conf = new SparkConf()
conf.set("spark.hadoop.dfs.replication", "1")) // works but cannot be changed later.
val sparkSession: SparkSession = SparkSession.builder.config(conf).getOrCreate()

Having made some search on the documentation, i came across with the configuration spark.sql.legacy.setCommandRejectsSparkCoreConfs that is added to core conf in Spark 3.0, and by default is set to true, and to change some other core confs it is needed to be set to false explicitly while the SparkSession is created. Even if i did that and prevent the errors like org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config, setting replication factor to a different value by setting both configs in a function like below

def setReplicationFactor(rf: Short): Unit = {
      val activeSparkSession = SparkSession.getActiveSession.get
      activeSparkSession.conf.set("spark.hadoop.dfs.replication", rf.toString)
      activeSparkSession.sparkContext.hadoopConfiguration.set("dfs.replication", rf.toString)
}

does not change the files being written with the updated SparkConf and SparkContext.hadoopConfiguration.

Is there any way to achieve writing files to HDFS with different replication factors within the same spark session?

belce
  • 203
  • 2
  • 5

1 Answers1

0

Totally can be done on a per file/folder basis. But you need to use hadoop tools.

Rest Call: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

There are also command-line options but I think the WebHDFS is cleaner.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21