Check if values of column pyspark df exist in other column pyspark df

Question

I have 2 pyspark dataframes and I want to check if the values of one column exist in a column in the other dataframe.

I have only seen solutions of how to filter the values that exist (like this), what I need to do is to return a column of true or false.

In pandas it will be something like this:

df_A["column1"].isin(df_B["column1"])

thanks in advance!

score 6 · Answer 1 · edited Jan 02 '21 at 02:58

EDIT: a cleaner way to do this:

import pyspark.sql.functions as F

result = df1.join(df2.withColumn('flag', F.lit(True)), 'col1', 'left').fillna(False)

result.show()
+----+-----+
|col1| flag|
+----+-----+
|   0| true|
|   1| true|
|   2|false|
+----+-----+

Old answer:

df1 = spark.createDataFrame(([0],[1],[2])).toDF('col1')
df2 = spark.createDataFrame(([0],[1],[3])).toDF('col1')

df1.join(df2, 'col1', 'semi') \
   .distinct() \
   .select('col1', F.lit(True).alias('flag')) \
   .join(df1, 'col1', 'right') \
   .fillna(False, 'flag') \
   .show()

+----+-----+
|col1| flag|
+----+-----+
|   0| true|
|   1| true|
|   2|false|
+----+-----+

This edited response works - the two Dataframes have to have column with the same name (in mck's example, the column is 'col1') and it works as expected. — John Youngtree, May 13 '21 at 16:16

score 2 · Accepted Answer · answered Nov 27 '20 at 05:09

2

you can collect all the values in column1 and then make a broadcast variable from it, on which you can write a udf

from pyspark.sql import udf
from pyspark.sql.types import BooleanType

df_B_col_1_values = df_B.rdd.map(lambda x: x.column1).distinct().collect()
df_B_col_1_values = sc.broadcast(set(df_B_col_1_values))

my_udf = udf(lambda x: x in df_B_col_1_values.value, BooleanType())
df_A = df_A.withColumn('column1_present', my_udf(col('column1'))

answered Nov 27 '20 at 05:09

ATIF ADIB

571
5
10

What if the dataframe you're trying to broadcast is too large for your cluster ? – Val Berthe May 16 '22 at 16:26
If the DataFrame you are broadcasting is too large, then you can achieve the same result by doing a left_semi join. – ATIF ADIB Oct 05 '22 at 04:02

score 0 · Answer 3 · answered Sep 07 '22 at 22:43

To tweak Mck's answer a little bit (drop duplicate df_A entries and select the relevant columns):

SHORT

df_A = df_A.withColumn("uid_A", F.monotonically_increasing_id())  # Add an index column
df_B = df_B.withColumn("uid_B", F.monotonically_increasing_id())  # Add an index column

df_A = df_A.alias('df_A')
(
    df_A
    .join(df_B.withColumn('flag',F.lit(True)),'col1','left')
    .fillna(False)
    .dropDuplicates(['uid_A'])
    .select('df_A.col1','flag')
).show()

'''
+----+-----+
|col1| flag|
+----+-----+
|   A|false|
|   B|false|
|   C| true|
|   B|false|
|   C| true|
|   D| true|
+----+-----+
'''

FULL


### Initialise dataframe ###

df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                     'col2': [1, 2, 3, 4, 5, 6]})
df_A = spark.createDataFrame(df_A)

df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                     'col2': [10, 20, 30, 40, 50, 60, 70]})
df_A = spark.createDataFrame(df_B)

### PREVIOUS ###

df_A.join(df_B.withColumn('flag',F.lit(True)),'col1','left').fillna(False).show()

'''
+----+----+----+-----+
|col1|col2|col2| flag|
+----+----+----+-----+
|   A|   1|null|false|
|   B|   2|null|false|
|   B|   4|null|false|
|   C|   3|  10| true|
|   C|   3|  40| true|
|   C|   5|  10| true|
|   C|   5|  40| true|
|   D|   6|  30| true|
+----+----+----+-----+
'''

### BETTER ###

df_A = df_A.withColumn("uid_A", F.monotonically_increasing_id())  # Add an index column
df_B = df_B.withColumn("uid_B", F.monotonically_increasing_id())  # Add an index column

df_A = df_A.alias('df_A')
(
    df_A
    .join(df_B.withColumn('flag',F.lit(True)),'col1','left')
    .fillna(False)
    .dropDuplicates(['uid_A'])
    .select('df_A.col1','flag')
).show()

'''
+----+-----+
|col1| flag|
+----+-----+
|   A|false|
|   B|false|
|   C| true|
|   B|false|
|   C| true|
|   D| true|
+----+-----+
'''

Extra nuggets: To take only column values based on the True/False values of the .isin results, it may be more straightforward to use pyspark's leftsemi join which takes only the left table columns based on the matching results of the specified cols on the right, shown also in this stackoverflow post.

# Pandas (isin)
df_A["column1"].isin(df_B["column1"])

# Pyspark (isin equivalent, applied to column values)
df_A.join(df_B, ["column1"], "leftsemi").select("column1")

'''
+----+----+
|col1|col2|
+----+----+
|   C|   3|
|   C|   5|
|   D|   6|
+----+----+
'''

# Pandas (~isin)
df_A["column1"].isin(df_B["column1"])

# Pyspark (~isin equivalent, applied to column values)
df_A.join(df_B, ["column1"], "leftanti").select("column1")

Further join examples/info - PySpark Join Types | Join Two DataFrames

Check if values of column pyspark df exist in other column pyspark df

3 Answers3

Linked