I am writing application using Spark dataset API on databricks notebook.
I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.
I tried using broadcast joins, tried persisting big table partitions in memory etc.....
What are my options here?