How to compute the similarity between every two keys in a pyspark dataframe?

Question

I have a dataframe with two column vertex and weight

----------------
vertex| weight
----------------
a     | w1
b     | w2
..    | ...
x     | wz
----------------

Im looking for computing similarity between every two vertex. In another words, Im looking for a new dataframe:

   -------------------------
    vertex1| vertex2| weight
    ------------------------
    a     | b       | w1+w2
    a     | c       | w1+w3
    ..    | ...
    a     | x       | w1+wx
    b     | a       | w2+w1
    b     | c       | w2+w3
    ....  
    -----------------------

any suggestion to do that plz?

Possible duplicate of [New Dataframe column as a generic function of other rows (spark)](https://stackoverflow.com/questions/48174484/new-dataframe-column-as-a-generic-function-of-other-rows-spark) — pault, May 22 '19 at 16:43

OmG · Accepted Answer · 2019-05-10T04:25:33.913

A simple solution is join the dataframe with itself on the constraint that the vertex is different. A naive implementation could be liked the following:

df1 = df.select(col("vertex").alias("vertex1"), col("weight").alias("weight1"))
df2 = df.select(col("vertex").alias("vertex2"), col("weight").alias("weight2"))
result =  df1.join(df2, col('vertex1') != col('vertex2'))\
             .withColumn('weight', df1['weight1'] + df2['weight2'])\
             .select(col('vertex1'), col('vertex2'), col('weight))

How to compute the similarity between every two keys in a pyspark dataframe?

1 Answers1