I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so.
I've tried using array_intersect, as well as array_union to attempt to do the comparison. However, this does not work as I get the message Resolved attribute missing
.
Here is the pyspark dataframe that I have created so far.
df = spark.createDataFrame(
[
(1, ['112','333']),
(2, ['112','223'])
],
["id", "minhash"] # add your column names here
)
minhash_sig = ['112', '223']
df2 = spark.createDataFrame([Row(c1=minhash_sig)])
And here is the code that I've used to try to compare the list to the pyspark column elements.
df.withColumn('minhash_sim',size(array_intersect(df2.c1, df.minhash)))
Does anyone know how I can do this comparison without this error?