1

I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so.

I've tried using array_intersect, as well as array_union to attempt to do the comparison. However, this does not work as I get the message Resolved attribute missing.

Here is the pyspark dataframe that I have created so far.

df = spark.createDataFrame(
    [
        (1, ['112','333']), 
        (2, ['112','223'])
    ],
    ["id", "minhash"]  # add your column names here
)
minhash_sig = ['112', '223']
df2 = spark.createDataFrame([Row(c1=minhash_sig)])

And here is the code that I've used to try to compare the list to the pyspark column elements.

df.withColumn('minhash_sim',size(array_intersect(df2.c1, df.minhash)))

Does anyone know how I can do this comparison without this error?

coderboi
  • 161
  • 3
  • 22

1 Answers1

1

the column from df2 will not be known to df1 unless you join them and create one object, you can try to first crossjoin both and then try your code:

df.crossJoin(df2).withColumn('minhash_sim',size(array_intersect("c1", "minhash")))\
  .show()

+---+----------+----------+-----------+
| id|   minhash|        c1|minhash_sim|
+---+----------+----------+-----------+
|  1|[112, 333]|[112, 223]|          1|
|  2|[112, 223]|[112, 223]|          2|
+---+----------+----------+-----------+
anky
  • 74,114
  • 11
  • 41
  • 70