4

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job? If so, how to do it.

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Sean Nguyen
  • 12,528
  • 22
  • 74
  • 113
  • Hi, if any of below answers has solved your problem please consider [accepting](http://meta.stackexchange.com/q/5234/179419) the best answer or adding your own solution. So, that it indicates to the wider community that you've found a solution. – mrsrinivas Oct 11 '17 at 09:36
  • I am not sure you can change it, this is how the file is written in the HDFS. Spark will allocate a task per file partition (kind of mapper). That why lot of people recommend to have block of 256m for Spark. – Thomas Decaux Oct 21 '17 at 07:47

3 Answers3

2

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
0

I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)
genomics-geek
  • 195
  • 4
  • 14
0

You can set blockSize of files that spark write:

myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)
Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124