0

I’m trying to optimise a job which is currently running for 2 hours. It is using rdd’s and converting that into DF’s make it even more worse. Here is what I did:

I have a very big fact table where a column is an array of structs. I can only join with the dimension table when that field is exploded.

Once I join with dimension table, I need to aggregate based on certain dimension field obtained from dim table and collect all the structs associated with that that field and convert back into array of structs.

Also, I cannot braodcast the dim table because it’s above 10 GB and the limit is 8 GB. So, instead I built a hash Map and did sc.broadcast to avoid shuffle.

Currently I’m using explode, hashmap lookups and collect_list functions, but however all the executors are getting killed.

I feel explode and collect_list are also expensive operations.

So, I’d greatly appreciate if you could suggest me any ways I can optimise it or suggest any alternative and make it run less than an hour.

In order to avoid shuffle, I created a hashmap and performed lookups using udf. However, would be helpful if there is any alternative for explode and collect_list since they are very expensive.

  • Probably not the answer you are looking for but if your fact table was designed properly you wouldn’t have this issue. Your fact table should contain FKs to your dimensions’ PKs and these should be numeric surrogate keys – NickW Jun 28 '23 at 18:53

0 Answers0