I have a Spark job that inner joins a large Hive table (5bn rows, 400MB x 1000 partitions, compressed parquet) with a much smaller table which is likely to contain less than a few hundred rows and on some/most weeks may be empty.
The data in the large table is not partitioned/bucketed by the join key and in any case the join key is very heavily skewed such that attempting a non-broadcast join causes some executors to exceed memory limits.
Luckily the smaller table size will always be way below the broadcast threshold so by using broadcast(rhs)
I can avoid shuffling the large Dataset by the skewed key.
Now when the RHS is empty Spark still seems to do a fair amount of work when it seems fairly obvious the result will be an empty Dataset.
I can only assume Spark does not check for empty Datasets before (inner) joining because the check may be expensive but would appreciate a definitive answer.
In my case I know the RHS will be small so invoking rhs.rdd.count will be cheap and I can skip the join if unnecessary.
I have had to omit business sensitive code but the basic algorithm is:
// Note small and large tables are cached for later re-use
smallTable
// Complex DAG
// write to hive
.cache
largeTable
// read from hive
.cache
largeTable.as("l")
.join(broadcast(smallTable.as("r")), $"l.key" === $"r.key", "inner")
.select($"l.*")
.as[LargeTable]
.mapPartitions(mapPartitionsFunction)
Thanks for any insight.
Terry.