What techniques can I use to mitigate a skewed join?

Question

I have identified my hanging job is indeed suffering from skew on its join.

What techniques can I use to make my job still succeed?

My code looks like the following:

from transforms.api import Input, Output, transform


@transform(
  my_output=Output("/path/to/my/output"),
  left_input=Input("/path/to/my/left_input"),
  right_input=Input("/path/to/my/right_input"),
)
def my_compute_function(my_output, left_input, right_input):
  left_df = left_input.dataframe()
  right_df = right_input.dataframe()

  output_df = left_df.join(right_df, on=["my_joint_column"])
  my_output.write_dataframe(output_df)

I can see one task in particular taking a long time:

score 0 · Answer 1 · answered Feb 23 '22 at 18:31

You have a couple of options, depending on the correctness of the distribution of your keys.

The first thing you must verify is:

Is the distribution of keys actually correct? i.e. Are the duplicated rows per key actually valid and need to be operated upon?

It's quite common for null values or other such invalid keys to be present in your data, and it's worth verifying if these either need to filtered out, or consolidated by picking just the latest version (this is commonly called a max row or min row operation, i.e. for each key, pick the key that has the maximum value on some other column, such as a timestamp column).

Assuming the present keys are in fact valid and need to be operated upon, you next must ask:

Is one side of the join significantly smaller than the other?

This typically means the right side of a join has 1/10th the number of keys as the left side. If this is true, you can try Salting the Join. It's worth noting that the size difference is not a function of the total rows in the dataset (although this can be a quick-and-dirty way to estimate this), it instead should be thought of as a count difference between the keys of the join, should you be doing a join. You can get the counts per key using the technique here, and the scale difference can be easily computed by dividing df1_COUNT by df2_COUNT instead of multiplying them.

If the right side of the join is not significantly smaller than the left, then:

You have a large join that has similar row counts on both sides. You must boost Executor memory to allow the rows to fit into memory

This means you must apply a profile to your Transform increasing the Executor memory above its current value (which can be found on the same page where AQE is noted)

What techniques can I use to mitigate a skewed join?

1 Answers1

Is the distribution of keys actually correct? i.e. Are the duplicated rows per key actually valid and need to be operated upon?

Is one side of the join significantly smaller than the other?

You have a large join that has similar row counts on both sides. You must boost Executor memory to allow the rows to fit into memory

Linked