1

I have noticed that 0 joins with null when using null-safe join (eqNullSafe).

df1 = spark.createDataFrame([(1, ), (None, )], ['df1_id'])
df2 = spark.createDataFrame([(None, ), (0, )], ['df2_id'])

df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'right').show()
#+------+------+
#|df1_id|df2_id|
#+------+------+
#|  null|     0|
#|  null|  null|
#+------+------+

df2.join(df1, df1.df1_id.eqNullSafe(df2.df2_id), 'left').show()
#+------+------+
#|df2_id|df1_id|
#+------+------+
#|     0|  null|
#|  null|  null|
#+------+------+

How do I make null join only with null?

ZygD
  • 22,092
  • 39
  • 79
  • 102

1 Answers1

1

You need to do inner join here

df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'inner').show()

Right now for 0 in right and there is no match in the left df, and we are doing right join that's why pyspark is keeping 0 in right df and it is becoming null in df1_id.

Rahul Kumar
  • 2,184
  • 3
  • 24
  • 46