I am having multiple DataFrames (DFs)
, storing monthly data of customers for the last 5 years. Some DFs
store Revenue
information, other stores Complaints
data and so on. All these DataFrames are Customer ID
and Month
based, as you can see in the example dataset below -
df_revenue.show(100)
+-----------+-----------+--------+-- ....--+-------+
|Customer_ID| City| Revenue| | Month|
+-----------+-----------+--------+-- .... +-------+
| 60012| Berlin| 24.64| |JAN2017|
| 60012| Berlin| 33.00| |FEB2017|
| 60012| Berlin| 44.76| |MAR2017|
....
....
....
| 58937| Hannover| 101.32| |JAN2016|
| 58937| Hannover| 66.72| |APR2016|
| 58937| Hannover| 190.50| |AUG2018|
| 58937| Hannover| 89.71| |DEC2018|
....
+-----------+-----------+--------+-- ....--+-------+
Ultimate aim is to join these DataFrames for specified 12 month period. So, for JUN2018
, we will join all these DataFrames by only taking respective data from Jul,2017 - Jun,2018 (last 12 months).
All these DataFrames, having complete 5 year data, have been stored in parquet
formats partitioned by
Month, as shown below -
df_revenue.write.format('parquet').partitionBy('Month').save('/../revenue')
Using partitionBy()
on Month
makes it faster to extract the data for respective months from the parquet
files, since the data is partitioned on Month
basis in parquet files. Once these subset DataFrames, encompassing 12 month data, are extracted, we can join them by repartitioning them first on Customer_ID
basis and then joining them, so as to avoid shuffles (a costly operation indeed) as much as possible. Something like this -
df_revenue = spark.read.format('parquet')\
.load('/../revenue')\
.where((col('Month') >= lit('2017-07-01'))
&
(col('Month') < lit('2018-06-30'))).\
.repartition(NumPartitions, 'Customer_ID')
df_complaints = .....
.repartition(NumPartitions, 'Customer_ID')
.
.
df_joined = df_revenue.join(df_complaints,['Customer ID'],how='left')
My Question: Since we are ultimately joining these subset DataFrames on Customer ID
basis, so will it be of any help if we repartition()
the data by Customer ID
before saving it as parquet files, like shown below?
df_revenue.repartition(n,'Customer_ID').write.format('parquet').partitionBy('Month').save('/../revenue')
The reason I am asking this is because I have noticed that once we load the DataFrame from the stored parquet
files, the number of partitions change and if it changes, then what is the use of repartitioning
the DataFrame before saving it as parquet
format in the very first place! I have seen so many posts on SO, where people repartition()
the DataFrame first and then store it in parquet.
Any clarification would be very appreciated.