I’m trying to optimise a job which is currently running for 2 hours. It is using rdd’s and converting that into DF’s make it even more worse. Here is what I did:
I have a very big fact table where a column is an array of structs. I can only join with the dimension table when that field is exploded.
Once I join with dimension table, I need to aggregate based on certain dimension field obtained from dim table and collect all the structs associated with that that field and convert back into array of structs.
Also, I cannot braodcast the dim table because it’s above 10 GB and the limit is 8 GB. So, instead I built a hash Map and did sc.broadcast to avoid shuffle.
Currently I’m using explode, hashmap lookups and collect_list functions, but however all the executors are getting killed.
I feel explode and collect_list are also expensive operations.
So, I’d greatly appreciate if you could suggest me any ways I can optimise it or suggest any alternative and make it run less than an hour.
In order to avoid shuffle, I created a hashmap and performed lookups using udf. However, would be helpful if there is any alternative for explode and collect_list since they are very expensive.