19

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).

Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html

I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

samthebest
  • 30,803
  • 25
  • 102
  • 142
  • How about using RRD#repartition ? – Soumya Simanta Nov 28 '14 at 19:29
  • @SoumyaSimanta That just forces a shuffle, the read will still be single threaded. – samthebest Nov 28 '14 at 19:38
  • Are you using HDFS ? How many nodes are part of the persistence layer? – Soumya Simanta Nov 28 '14 at 19:58
  • @SoumyaSimanta I'm using HDFS, the number of nodes & CPUs is irrelevant, as one can easily create more partitions than one has threads to compute them. With `textFile` for splittable compression codecs it's easy ... try it, `sc.textFile(p, 100)` will result in 100 partitions no matter what your cluster configurations. – samthebest Nov 29 '14 at 16:26

5 Answers5

13

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

The source of ParquetOuputFormat is here, if you want to dig into details.

The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

samthebest
  • 30,803
  • 25
  • 102
  • 142
C4stor
  • 8,355
  • 6
  • 29
  • 47
  • Thanks this answer seems the most accurate so far. Please could you A) confirm @Prokod answer is impossible (it's saying one can split a parquet block), and B) write a line of example Spark code using `parquet.block.size` (and `Dataset`s) that writes the parquet for peoples copy and paste needs :) – samthebest Aug 10 '16 at 07:41
  • A) @Prokod is confusing parquet block size and hdfs block size. See e.g https://groups.google.com/forum/#!topic/parquet-dev/t1iu0G-wLpE "There is still one limitation which is the smallest split you can get is of size of 1 row group." B) After 2 years, you may use any parquet version, but it should be close to sparkContext.hadoopConfiguration.set("parquet.block.size", newSize) and then use your context to write your dataset as usual. – C4stor Aug 10 '16 at 09:59
2

The new way of doing it (Spark 2.x) is setting

spark.sql.files.maxPartitionBytes

Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)

From my experience, Hadoop settings no longer have effect.

Zoe
  • 27,060
  • 21
  • 118
  • 148
F Pereira
  • 1,157
  • 10
  • 9
1

Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it

val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length

You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)

suztomo
  • 5,114
  • 2
  • 20
  • 21
  • Suppose one wants a significantly greater level of parallelism than what will be implied by the HDFS block size, is there a way of ensuring that when the parquet is written, it can support splitting finer than the HDFS block size?? – samthebest Dec 07 '14 at 10:36
  • No. I thought you don't want to move HDFS blocks from original location. If you're ok with shuffling, repartition the blocks. – suztomo Dec 07 '14 at 14:48
  • Yes I want to avoid a shuffle, ideally I want 2 - 4 partitions per CPU without a shuffle. Suppose I have 1 GB of parquet, and 1000 CPUs, clearly with normal block sizes, this will mean most of my CPUs will go unused. My question is *when writting the parquet* is there a way to control the splitability? I guess most people don't care and just do a shuffle anyway, since shuffling data around the block size of HDFS should be fairly fast. (Oh BTW I will accept your answer once I've tested it, I don't have access to my cluster for a few days) – samthebest Dec 07 '14 at 15:56
1

You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs. For read you could specify spark.sql.shuffle.partitions parameter.

Kirby
  • 15,127
  • 10
  • 89
  • 104
0

To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.

By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.

For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)

Then there will be two partitions those splits will be read into.

Prokod
  • 1