How to change hdfs block size in pyspark?

Question

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job? If so, how to do it.

Hi, if any of below answers has solved your problem please consider [accepting](http://meta.stackexchange.com/q/5234/179419) the best answer or adding your own solution. So, that it indicates to the wider community that you've found a solution. — mrsrinivas, Oct 11 '17 at 09:36
I am not sure you can change it, this is how the file is written in the HDFS. Spark will allocate a task per file partition (kind of mapper). That why lot of people recommend to have block of 256m for Spark. — Thomas Decaux, Oct 21 '17 at 07:47

mrsrinivas · Answer 1 · 2016-12-05T14:32:46.913

2

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

edited Dec 05 '16 at 14:32

answered Dec 04 '16 at 13:26

mrsrinivas

34,112
13
125
125

It is not working. I am using pyspark version 1.6.2. – Sean Nguyen Dec 05 '16 at 11:22

score 0 · Answer 2 · answered Jan 20 '17 at 18:46

0

I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

answered Jan 20 '17 at 18:46

genomics-geek

195
4
14

score 0 · Answer 3 · answered Jan 03 '23 at 20:10

0

You can set blockSize of files that spark write:

myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)

answered Jan 03 '23 at 20:10

Thomas Decaux

21,738
2
113
124

How to change hdfs block size in pyspark?

3 Answers3

Linked