2

I have a Spark job that inner joins a large Hive table (5bn rows, 400MB x 1000 partitions, compressed parquet) with a much smaller table which is likely to contain less than a few hundred rows and on some/most weeks may be empty.

The data in the large table is not partitioned/bucketed by the join key and in any case the join key is very heavily skewed such that attempting a non-broadcast join causes some executors to exceed memory limits.

Luckily the smaller table size will always be way below the broadcast threshold so by using broadcast(rhs) I can avoid shuffling the large Dataset by the skewed key.

Now when the RHS is empty Spark still seems to do a fair amount of work when it seems fairly obvious the result will be an empty Dataset.

I can only assume Spark does not check for empty Datasets before (inner) joining because the check may be expensive but would appreciate a definitive answer.

In my case I know the RHS will be small so invoking rhs.rdd.count will be cheap and I can skip the join if unnecessary.

I have had to omit business sensitive code but the basic algorithm is:

// Note small and large tables are cached for later re-use

    smallTable
      // Complex DAG
      // write to hive
      .cache

    largeTable
      // read from hive
      .cache

    largeTable.as("l")
      .join(broadcast(smallTable.as("r")), $"l.key" === $"r.key", "inner")
      .select($"l.*")
      .as[LargeTable]
      .mapPartitions(mapPartitionsFunction)

Thanks for any insight.

Terry.

Terry Dactyl
  • 1,839
  • 12
  • 21
  • Can you add the code in your question? – Emiliano Martinez Oct 21 '19 at 08:07
  • Note that there are some better ways to check if a dataframe is empty (especially converting to an rdd can be slow): https://stackoverflow.com/questions/32707620/how-to-check-if-spark-dataframe-is-empty – Shaido Oct 21 '19 at 08:36

0 Answers0