-1

I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP's in the smaller dataframe appear in the IP column of the large dataframe.

At the moment all I can manage is to compare the first row of the columns, second rows of the columns etc.

Thanks in advance!

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • I think you are describing an [inner join](https://stackoverflow.com/questions/40343625/joining-spark-dataframes-on-the-key/40356825) – abiratsis Feb 27 '20 at 21:32

1 Answers1

0

How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:

%sql
SELECT  *
FROM    t1 A
WHERE   NOT EXISTS (SELECT  1
                      FROM  t2 B
                     WHERE  B.colx = A.colx)
thebluephantom
  • 16,458
  • 8
  • 40
  • 83