-2

I have a dataframe with two column vertex and weight

----------------
vertex| weight
----------------
a     | w1
b     | w2
..    | ...
x     | wz
----------------

Im looking for computing similarity between every two vertex. In another words, Im looking for a new dataframe:

   -------------------------
    vertex1| vertex2| weight
    ------------------------
    a     | b       | w1+w2
    a     | c       | w1+w3
    ..    | ...
    a     | x       | w1+wx
    b     | a       | w2+w1
    b     | c       | w2+w3
    ....  
    -----------------------

any suggestion to do that plz?

OmG
  • 18,337
  • 10
  • 57
  • 90
moudi
  • 137
  • 11
  • Possible duplicate of [New Dataframe column as a generic function of other rows (spark)](https://stackoverflow.com/questions/48174484/new-dataframe-column-as-a-generic-function-of-other-rows-spark) – pault May 22 '19 at 16:43

1 Answers1

1

A simple solution is join the dataframe with itself on the constraint that the vertex is different. A naive implementation could be liked the following:

df1 = df.select(col("vertex").alias("vertex1"), col("weight").alias("weight1"))
df2 = df.select(col("vertex").alias("vertex2"), col("weight").alias("weight2"))
result =  df1.join(df2, col('vertex1') != col('vertex2'))\
             .withColumn('weight', df1['weight1'] + df2['weight2'])\
             .select(col('vertex1'), col('vertex2'), col('weight))
OmG
  • 18,337
  • 10
  • 57
  • 90