3

I am having multiple DataFrames (DFs), storing monthly data of customers for the last 5 years. Some DFs store Revenue information, other stores Complaints data and so on. All these DataFrames are Customer ID and Month based, as you can see in the example dataset below -

df_revenue.show(100)
    +-----------+-----------+--------+-- ....--+-------+
    |Customer_ID|       City| Revenue|         |  Month|
    +-----------+-----------+--------+-- ....  +-------+
    |      60012|     Berlin|   24.64|         |JAN2017|
    |      60012|     Berlin|   33.00|         |FEB2017|
    |      60012|     Berlin|   44.76|         |MAR2017|
    ....
    ....
    ....
    |      58937|   Hannover|  101.32|         |JAN2016|
    |      58937|   Hannover|   66.72|         |APR2016|
    |      58937|   Hannover|  190.50|         |AUG2018|
    |      58937|   Hannover|   89.71|         |DEC2018|
    ....
    +-----------+-----------+--------+-- ....--+-------+

Ultimate aim is to join these DataFrames for specified 12 month period. So, for JUN2018, we will join all these DataFrames by only taking respective data from Jul,2017 - Jun,2018 (last 12 months).

All these DataFrames, having complete 5 year data, have been stored in parquet formats partitioned by Month, as shown below -

df_revenue.write.format('parquet').partitionBy('Month').save('/../revenue')

Using partitionBy() on Month makes it faster to extract the data for respective months from the parquet files, since the data is partitioned on Month basis in parquet files. Once these subset DataFrames, encompassing 12 month data, are extracted, we can join them by repartitioning them first on Customer_ID basis and then joining them, so as to avoid shuffles (a costly operation indeed) as much as possible. Something like this -

df_revenue = spark.read.format('parquet')\
                       .load('/../revenue')\
                       .where((col('Month') >= lit('2017-07-01')) 
                                      & 
                             (col('Month') < lit('2018-06-30'))).\
                       .repartition(NumPartitions, 'Customer_ID')

df_complaints = .....
                       .repartition(NumPartitions, 'Customer_ID')
.
.
df_joined = df_revenue.join(df_complaints,['Customer ID'],how='left')

My Question: Since we are ultimately joining these subset DataFrames on Customer ID basis, so will it be of any help if we repartition() the data by Customer ID before saving it as parquet files, like shown below?

df_revenue.repartition(n,'Customer_ID').write.format('parquet').partitionBy('Month').save('/../revenue')

The reason I am asking this is because I have noticed that once we load the DataFrame from the stored parquet files, the number of partitions change and if it changes, then what is the use of repartitioning the DataFrame before saving it as parquet format in the very first place! I have seen so many posts on SO, where people repartition() the DataFrame first and then store it in parquet.

Any clarification would be very appreciated.

cph_sto
  • 7,189
  • 12
  • 42
  • 78

0 Answers0