61

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:

df.coalesce(1)
    .write
    .partitionBy("entity", "year", "month", "day", "status")
    .mode(SaveMode.Append)
    .parquet(s"$location")

I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.

I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.

But is there a better way to do this using the standard Spark SQL API?

Patrick McGloin
  • 2,204
  • 1
  • 14
  • 26

3 Answers3

122

I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.

To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:

import spark.implicits._
df
  .repartition($"entity", $"year", $"month", $"day", $"status")
  .write
  .partitionBy("entity", "year", "month", "day", "status")
  .mode(SaveMode.Append)
  .parquet(s"$location")

Once I do that I get one parquet file per output partition, instead of multiple files.

I tested this in Python, but I assume in Scala it should be the same.

Ismail H
  • 4,226
  • 2
  • 38
  • 61
mortada
  • 1,658
  • 2
  • 14
  • 12
  • 5
    I guess @PatrickMcGloin didn't report back, but this works beautifully and I will encourage Patrick to accept the answer. – Glennie Helles Sindholt Nov 23 '16 at 13:49
  • @GlennieHellesSindholt - you are right. Answer accepted. Thanks user3033652. – Patrick McGloin Nov 29 '16 at 09:12
  • 1
    just a note that in scala 2.0 one needs to give a new org.apache.spark.sql.Column("entity") etc. as argument to repartition – morpheus May 12 '17 at 02:24
  • @morpheus Or simply `$"entity"`. – Sanjay T. Sharma Jun 02 '17 at 11:56
  • Using Spark 1.6 this works perfectly for Parquet. However, with Avro I still end up with multiple files per partition. – fixje Aug 08 '17 at 13:32
  • @morpheus I'm getting `org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns` even though I'm using only 4 columns out of 30. Any ideas? – Hedrack Nov 20 '17 at 11:32
  • Is there any place this difference between `coalesce` and `repartition` is better explained? Specifically the implications for write parallelism. Comments in code: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2862-L2881 – autodidacticon Mar 03 '18 at 02:00
  • Did someone found a way to implement this in pyspark as above example does not work. in Pyspark import pyspark.implicits._ or import spark.implicits._ does not exist and also the parameter SaveMode.Append is unknown – Alex Ortner Oct 17 '19 at 07:51
  • Keep in mind that you'll get `spark.sql.shuffle.partitions` partitions. You can always use `repartition(numPartitions: Int, partitionExprs: Column*)` if you know how many files you should expect on the file system – jjuraszek Jan 06 '20 at 12:48
  • How do you use $ notation in PySpark?? – Topde Apr 17 '20 at 11:08
  • you don't need any $ notation in pyspark, just use strings – mortada May 20 '20 at 05:31
  • example: * create dataframe => `df = spark.createDataFrame({('test', 1, .1), ('test', 2, .2), ('test2', 3, .3)}, 'a: string, b: int, c: float')` * partition and save => `df.repartition('a', 'b').write.partitionBy('a', 'b').mode('append').parquet('output')` – mortada May 20 '20 at 05:32
  • why will number of files not be equal to number of partitions? 200 by default? – dksahuji May 08 '21 at 03:41
  • should be .mode( saveMode="overwrite").parquet("path") – vaquar khan Apr 19 '23 at 16:23
10

By definition :

coalesce(numPartitions: Int): DataFrame Returns a new DataFrame that has exactly numPartitions partitions.

You can use it to decrease the number of partitions in the RDD/DataFrame with the numPartitions parameter. It's useful for running operations more efficiently after filtering down a large dataset.

Concerning your code, it doesn't perform well because what you are actually doing is :

  1. putting everything into 1 partition which overloads the driver since it's pull all the data into 1 partition on the driver (and also it not a good practice)

  2. coalesce actually shuffles all the data on the network which may also result in performance loss.

The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

The shuffle concept is very important to manage and understand. It's always preferable to shuffle the minimum possible because it is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

Concerning partitioning parquet, I suggest that you read the answer here about Spark DataFrames with Parquet Partitioning and also this section in the Spark Programming Guide for Performance Tuning.

I hope this helps !

Community
  • 1
  • 1
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Hi, thanks for your reply. I'm fine with the coalesce having a cost. In my current code I partition the data manually, then call coalesce and save on each partition which runs fine. But instead of writing the partitionBy step myself I'd like to use the proper API. But in doing so, the coalesce has to come before the partitionBy. That's where I'm stuck. – Patrick McGloin Jan 15 '16 at 09:54
  • 1
    But what you are doing is putting everything in 1 partition and then partitionBy, you ought just partitionBy instead – eliasah Jan 15 '16 at 09:56
  • 1
    But then I end up with many files. I want to limit the number of Parquet files created. I'm streaming and saving once per minute, so every partition already has 1440 files. I don't want to multiply that. – Patrick McGloin Jan 15 '16 at 09:58
  • Ok let me put it this way, your code will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct entity and 3 distinct years for 12 months each, etc you might end up creating 1440 files. – eliasah Jan 15 '16 at 10:03
  • I think there is crossed wires. I am streaming the data in and calling save every minute. So as long as there is data for that partition, it will get a new file every minute. SaveMode.Append in Parquet creates a new file. – Patrick McGloin Jan 15 '16 at 10:05
  • @zero323 what do you think about this ? – eliasah Jan 18 '16 at 09:01
  • 1
    I still don't know how to efficiently save data into one Parquet file per partition using the standard API, so I don't think it answers the question. – Patrick McGloin Jan 18 '16 at 09:18
  • @PatrickMcGloin If I may ask, how many partition do you think that your DataFrame should have ? – eliasah Jan 18 '16 at 09:49
  • Ha, sorry. Per day around 10. – Patrick McGloin Jan 18 '16 at 09:55
  • and how many unique ("entity", "year", "month", "day", "status") tuple do you have ? I would say 1440... – eliasah Jan 18 '16 at 10:37
  • @eliasah, thank you ... I have a huge data , which is accumalated each year , quarterly-wise. This data is skewed a bit , when I try to get all data into one dataframe by repartitoning it into ("year", "quarter") it is shuffling a lot of data on disk spill which is making my job slow , more over only one executor working 80% of the time. Hence I decided to 1) get distinct groups of dataframe , grouping by year and quarter-wise. 2) iterate/loop this distinct data frame by group fetch the data group where = year of the group save this dataframe/grop as parquet file – BdEngineer Jan 17 '19 at 15:20
  • @eliasah continue iteration. In Java we can use for loop on groups but in spark with scala how to do it ? your help is highly appreciated. – BdEngineer Jan 17 '19 at 15:20
  • @eliasah Hence I decided to 1) get distinct groups of dataframe , grouping by year and quarter-wise. 2) iterate/loop this distinct data frame by group ; fetch the data group where = year of the group ; save this dataframe/grop as parquet file ; continue iteration;;;; In Java we can use for loop on groups but in spark with scala how to do it ? – BdEngineer Jan 17 '19 at 15:21
0

It isn't much on top of @mortada's solution, but here's a little abstraction that ensures you are using the same partitioning to repartition and write, and demonstrates sorting as wel:

  def one_file_per_partition(df, path, partitions, sort_within_partitions, VERBOSE = False):
    start = datetime.now()
    (df.repartition(*partitions)
      .sortWithinPartitions(*sort_within_partitions)
      .write.partitionBy(*partitions)
      # TODO: Format of your choosing here
      .mode(SaveMode.Append).parquet(path)
      # or, e.g.:
      #.option("compression", "gzip").option("header", "true").mode("overwrite").csv(path)
    )
    print(f"Wrote data partitioned by {partitions} and sorted by {sort_within_partitions} to:" +
        f"\n  {path}\n  Time taken: {(datetime.now() - start).total_seconds():,.2f} seconds")

Usage:

one_file_per_partition(df, location, ["entity", "year", "month", "day", "status"])
Alain
  • 26,663
  • 20
  • 114
  • 184