Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?

Question

I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP's in the smaller dataframe appear in the IP column of the large dataframe.

At the moment all I can manage is to compare the first row of the columns, second rows of the columns etc.

Thanks in advance!

I think you are describing an [inner join](https://stackoverflow.com/questions/40343625/joining-spark-dataframes-on-the-key/40356825) — abiratsis, Feb 27 '20 at 21:32

score 0 · Accepted Answer · answered Feb 27 '20 at 17:11

How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:

%sql
SELECT  *
FROM    t1 A
WHERE   NOT EXISTS (SELECT  1
                      FROM  t2 B
                     WHERE  B.colx = A.colx)

Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?

1 Answers1