How to split parquet files into many partitions in Spark?

Question

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).

Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html

I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

@SoumyaSimanta That just forces a shuffle, the read will still be single threaded. — samthebest, Nov 28 '14 at 19:38
Are you using HDFS ? How many nodes are part of the persistence layer? — Soumya Simanta, Nov 28 '14 at 19:58
@SoumyaSimanta I'm using HDFS, the number of nodes & CPUs is irrelevant, as one can easily create more partitions than one has threads to compute them. With `textFile` for splittable compression codecs it's easy ... try it, `sc.textFile(p, 100)` will result in 100 partitions no matter what your cluster configurations. — samthebest, Nov 29 '14 at 16:26

score 13 · Accepted Answer · edited Aug 10 '16 at 07:39

13

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

The source of ParquetOuputFormat is here, if you want to dig into details.

The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

edited Aug 10 '16 at 07:39

samthebest

30,803
25
102
142

answered Apr 23 '15 at 09:23

C4stor

8,355
6
29
47

Thanks this answer seems the most accurate so far. Please could you A) confirm @Prokod answer is impossible (it's saying one can split a parquet block), and B) write a line of example Spark code using `parquet.block.size` (and `Dataset`s) that writes the parquet for peoples copy and paste needs :) – samthebest Aug 10 '16 at 07:41
A) @Prokod is confusing parquet block size and hdfs block size. See e.g https://groups.google.com/forum/#!topic/parquet-dev/t1iu0G-wLpE "There is still one limitation which is the smallest split you can get is of size of 1 row group." B) After 2 years, you may use any parquet version, but it should be close to sparkContext.hadoopConfiguration.set("parquet.block.size", newSize) and then use your context to write your dataset as usual. – C4stor Aug 10 '16 at 09:59

score 2 · Answer 2 · edited Apr 28 '19 at 12:02

2

The new way of doing it (Spark 2.x) is setting

spark.sql.files.maxPartitionBytes

Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)

From my experience, Hadoop settings no longer have effect.

edited Apr 28 '19 at 12:02

Zoe

27,060
21
118
148

answered Jan 09 '18 at 11:09

F Pereira

1,157
10
9

As per my comment on that JIRA, trying this didn't work for me :( – samthebest Jan 09 '18 at 16:15

score 1 · Answer 3 · answered Dec 05 '14 at 21:53

1

Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it

val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length

You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)

answered Dec 05 '14 at 21:53

suztomo

5,114
2
20
21

Suppose one wants a significantly greater level of parallelism than what will be implied by the HDFS block size, is there a way of ensuring that when the parquet is written, it can support splitting finer than the HDFS block size?? – samthebest Dec 07 '14 at 10:36
No. I thought you don't want to move HDFS blocks from original location. If you're ok with shuffling, repartition the blocks. – suztomo Dec 07 '14 at 14:48
Yes I want to avoid a shuffle, ideally I want 2 - 4 partitions per CPU without a shuffle. Suppose I have 1 GB of parquet, and 1000 CPUs, clearly with normal block sizes, this will mean most of my CPUs will go unused. My question is *when writting the parquet* is there a way to control the splitability? I guess most people don't care and just do a shuffle anyway, since shuffling data around the block size of HDFS should be fairly fast. (Oh BTW I will accept your answer once I've tested it, I don't have access to my cluster for a few days) – samthebest Dec 07 '14 at 15:56

score 1 · Answer 4 · edited May 12 '15 at 19:31

1

You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs. For read you could specify spark.sql.shuffle.partitions parameter.

edited May 12 '15 at 19:31

Kirby

15,127
10
89
104

answered May 12 '15 at 18:45

Ruslan Pelin

11
1

score 0 · Answer 5 · answered Jun 10 '16 at 21:05

To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.

By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.

For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)

Then there will be two partitions those splits will be read into.

This sounds promising but by @C4stor argument won't work. – samthebest Aug 10 '16 at 07:39 — samthebest, Aug 10 '16 at 07:39

How to split parquet files into many partitions in Spark?

5 Answers5

Linked

Related