Fast Spark alternative to WHERE column IN other_column

Question

I am looking for a fast PySpark alternative to

SELECT foo FROM bar
WHERE foo IN (SELECT baz FROM bar)

Collecting beforehand into a Python list is absolutely not an option as the dataframes handled are quite large and collect takes up tremendous amounts of time relative to the other option I came up with. So I can't think of a way to use a native PySparkian where(col(bar).isin(baz)) since baz would have to be a list in this case.

One option I came up with is right JOIN as a replacement of IN and left_semi JOIN as a replacement of NOT IN, consider a following example:

bar_where_foo_is_in_baz = bar.join(bar.select('baz').alias('baz_'), col('foo') == col('baz_'), 'right').drop('baz_')

This however is quite verbose, pretty hard to interpret when read after a while and results in a fair bit of head scratching when a larger number of conditions are handled within WHERE so I'd like to avoid that.

Are there any other options?

EDIT (please read):

As I have seem to have misled quite a number of answers, my specific requirement is translating a "WHERE - IN" clause into a PySpark without .collect() or in general, mapping to Pythonic list (as the internal function .isin() would require me).

Pyspark has decent SQL support so believe you can do this using pyspark SQL library. https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#running-sql-queries-programmatically should get you going. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html for ref as well. — born_naked, May 18 '20 at 06:21
Do you know that the "broadcast JOIN" works exactly like this, i.e. creating a hashmap of values in all executors? So it's a plain SQL query, with an extra hint if you want to make sure that Spark optimizer does not mess up. Do not reinvent the wheel. This class of problems has been around since the first MPP databases 25 years ago. — Samson Scharfrichter, May 18 '20 at 09:23
in spark i think ur best bet is to use left_semi or left_anti — murtihash, May 18 '20 at 17:54
https://stackoverflow.com/questions/42545788/pyspark-match-the-values-of-a-dataframe-column-against-another-dataframe-column seems to have the answer you need. — born_naked, May 18 '20 at 19:17
yep, that's what I am doing right now. I hoped an alternative would exist. — LordBertson, May 18 '20 at 19:49

score 0 · Answer 1 · answered May 18 '20 at 09:39

From Jacek Laskowski's GitBook

Spark SQL uses broadcast join (aka broadcast hash join) instead of hash join to optimize join queries when the size of one side data is below spark.sql.autoBroadcastJoinThreshold

Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables (dimensions) that could then be used to perform a star-schema join. It can avoid sending all data of the large table over the network.

And also

Spark SQL 2.2 supports BROADCAST hints using broadcast standard function or SQL comments:

SELECT /*+ MAPJOIN(b) */ …

SELECT /*+ BROADCASTJOIN(b) */ …

SELECT /*+ BROADCAST(b) */ …

Actually the Spark documentation is more comprehensive about the hints:

The BROADCAST hint guides Spark to broadcast each specified table when joining them with another table or view. When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, even if the statistics is above the configuration spark.sql.autoBroadcastJoinThreshold.

When both sides of a join are specified, Spark broadcasts the one having the lower statistics. Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) support BHJ.

Finally, if you are serious about developping efficient Spark jobs, you should spend some time on understanding how that beast works.

For instance that presentation about joins from Databricks should be helpful

I will be responding to the comment as well as the answer here. The thing is, I am absolutely aware of the broadcast join and its use, yet, I am not quite able to discern how do you suggest it to be applied to the scenario at hand. I am not specifically interested in optimizing the JOIN I have used as an example, but rather, I'd like to somehow translate the "WHERE - IN" clause to a Spark without the need to collect, or more generally, mapping to a list (as .isin() function would require). Perhaps I should have been more clear in my requirement, I will edit the question. — LordBertson, May 18 '20 at 10:50
So, can you explain to me the difference between your `SELECT a.* FROM a WHERE a.x IN (SELECT xx FROM b WHERE wtf)` and `SELECT a.* FROM a JOIN (SELECT xx FROM b WHERE wtf) bb ON a.x =bb.xx` and also `SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE wtf AND b.xx =a.x)` ? AFAIK a decent query compiler will translate all three to the same exact execution plan. — Samson Scharfrichter, May 18 '20 at 11:41
I would assume there's none. But I have already stated myself in the question that JOIN, specifically an INNER one, can be used as a "WHERE - IN" and that is my temporary solution, yet it's a bit unwieldy and I am hoping to avoid it. Would it prove that I cannot avoid it, optimizations on the join are very welcome. — LordBertson, May 18 '20 at 11:51

Fast Spark alternative to WHERE column IN other_column

1 Answers1