0

i have a job where one dataset is joined with multiple datasets. To not incur the cost of reading from source every time i have created GlobalTempView of the dataset. In hdfs the dataset is stored in 700 partitions but in spark we have default number of partitions as 200. I want to cache the repartitioned view of dataset, but i am not able to achieve it. below is the code sample

        Dataset<Row> accountToUserIdDataset = hdfsClient.loadDataSet("UserDataSetPath"); //<-700 partitions
        accountToUserIdDataset = accountToUserIdDataset.repartition(200); //repartitions so that we dont have to repartitions on every shuffle.
        accountToUserDetailsDataset.createOrReplaceGlobalTempView("USERID_TO_DETAILS"); //so that we would be able to join with userdetails in multiple spark sessions created using newSession()
        accountToUserIdDataset.cache();
        accountToUserDetailsDataset.count(); //<- so that we force the materialisation of cache.

// across different spark sessions join with accountToUserIdDataset.

With the above code, when i am trying to do join with accountToUserIdDataset,

  1. hdfsClient.loadDataSet("UserDataSetPath") is reused across multiple joins (confirmed by seeing 700 stage skipped) but the repartitioning is happening in all the tasks.
  2. the storage tab shows accountToUserDetailsDataset cached with 200 partitions.
  3. below is a screenshot of one job, stage7 is where read from hdfs is happening and it is skipped. In stage 28 we are repartitioning to 200 partitions and then caching it. I was expecting stage 28 to be skipped as well.

Is this my understanding gap or is it possible to cache partitioned dataset? i have gone through this and this answer as well. Another possible reason of this could be that in the joins we are only interested in subset of columns of accountToUserDetailsDataset, because of this spark is having to reshuffle the dataset?

enter image description here

best wishes
  • 5,789
  • 1
  • 34
  • 59
  • my problem is that likes of stage 28 is happening on each join. and taking lots of time. and my aim is to cache data in such a way that i don't need reprocessing. – best wishes Jun 23 '23 at 13:53
  • You are doing `accountToUserIdDataset.cache()` and then `.count()` a diff dataframe? – mazaneicha Jun 23 '23 at 13:59
  • Besides, the exchange between stage-7 -> stage-28 seems to be where repartition happens. Then new df is cached (green dot; on the Storage tab in Spark UI you should see into how many partitions cached df was split). There is some follow-up transform (WSCG) that happens within the same stage but it is not cached. – mazaneicha Jun 23 '23 at 14:06
  • upon looking at sql view, i saw that one filter stage is being executed. after caching. that is what is appearing on stage 28. – best wishes Jun 23 '23 at 16:07
  • @mazaneicha checkout my answer. – best wishes Jun 23 '23 at 16:58
  • Very good, congrats! – mazaneicha Jun 23 '23 at 17:20

1 Answers1

0

Eureka!!!

The shuffle is happening because the join is done on a different key than default repartition(200). After changing from

accountToUserIdDataset = accountToUserIdDataset.repartition(200); //repartitions so that we dont have to repartitions on every shuffle.

To

accountToUserIdDataset = accountToUserIdDataset.repartition(200, new Column("account_id")); 

The extra sort, shuffle went away. below is the new DAG after the fix. stage 28 from above hash been merged with stage 6. enter image description here

best wishes
  • 5,789
  • 1
  • 34
  • 59