0

I have a parquet partitioning issue that I am trying to solve. I have read a lot of material on partitioning in this site and on the web but still couldn't explain my problem.

Step 1: I have a large dataset (~2TB) that has MODULE and DATE columns and is partitioned by DATE consisting of 86 days. Each DATE partition has 21 files and therefore has a total of 86 * 21 = 1806 files.

Step 2: I needed to aggregate the data based on the MODULE column so I loaded saved it as another parquet partitioning it by MODULE. There are 9 modules and each module has data from all of the 86 days and so the resulting parquet had 9 * 1806 = 16254 files.

Step 3 I loaded each MODULE partition through a for loop, performed my aggregations, and saved it back as a folder in append mode such that I have the 9 modules as folders: s3://path/MODULE A/, s3://path/MODULE B etc. They are not partitioned by module but just saved as folders. Since my default spark numpartitions was 201, each module folder had 201 files and so a total of 9 * 201 = 1809 files

Step 4 So far so good but I needed to partition it back by DATE. So I looped through each MODULE partition and saved the file as a single parquet file without any partitions. That resulted in a total of 2751 files. I don't know how this is calculated.

Step 5 I then loaded the entire unpartitioned and saved it partitioning by DATE. This resulted in about 39k files and each file is about 1.5MB. So I have a huge amount of small files and it takes an immensely long time to load the parquet or do any kind of operations on them such as groupBy etc.

After doing some more reading, I tried using repartition(1).partitionBy('DATE') in step 4 to reduce the number of files, but it failed towards the end. I know I am doing something wrong from step 4 onwards. Is there a more efficient way to do the whole thing?

Thank You

thentangler
  • 1,048
  • 2
  • 12
  • 38
  • since you know you `86 days` of data, why can't you repartition it to 86 in step#4 maybe like`df.repartition(86, "DATE")` – Som Jul 14 '20 at 04:52
  • has nothing to do with pyspark – thebluephantom Jul 14 '20 at 08:25
  • @thebluephantom I added the pyspark tag since I was coding in pyspark for any potential solutions to be given in pyspark. @ SomeshwarKale, I tried doing that and keep getting failed results after the job runs for more than 8 hrs or so. Any insight as to what I am doing wrong? – thentangler Jul 14 '20 at 13:12
  • thats a slight down fall as it is common to the whole framework afaik – thebluephantom Jul 14 '20 at 13:58

1 Answers1

1

Finding correct number of partitions is your concern then.

Suppose you have 86 days data and you want to save it partitioned by date. Then you should be aware how many files you want to create under one partition.

Suppose you have 3 GB data for each date, then probably you want atleast 6 files in each date folder.

You can achieve this as

df.repartition(6,'date').write.partitionBy('date')...

Now if you want to restrict number of records in each file then use the property

df.repartition(6, 'date').write.option("maxRecordsPerFile", 10000).partitionBy('date')...
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • Is it `repartition(col, numpartitions)` or `repartition(numpartitions, col)` ? – thentangler Jul 14 '20 at 04:31
  • I tried that but it failed after running for around 8 hours. And the failure reason is : `Stage cancelled because SparkContext was shut down`. Is it because each file size is too large? – thentangler Jul 14 '20 at 13:10
  • Try increaing the number of files in each partition as it will allow data to be scattered on multiple workers – Shubham Jain Jul 14 '20 at 13:16