How to create edge list from spark data frame in Pyspark?

Question

I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.

For example, below is my vertices data frame. I have a list of ids and they belong to different groups.

+---+-----+
|id |group|
+---+-----+
|a  |1    |
|b  |2    |
|c  |1    |
|d  |2    |
|e  |3    |
|a  |3    |
|f  |1    |
+---+-----+

My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id could appear in multiple groups (e.g. id a above is in group 1 and 3). Below is the edge list data frame that I'd like to get:

+---+-----+-----+
|src|dst  |group|
+---+-----+-----+
|a  |c    |1    |
|a  |f    |1    |
|c  |f    |1    |
|b  |d    |2    |
|a  |e    |3    |
+---+-----+-----+

Thanks in advance!

what if you add one more row `(id='f', group=1)`, how do we know which id is `src` and which id is `dst`? is there any other columns to sort ids for each group? — jxc, Dec 29 '20 at 03:55
@jxc This is a good point. Please see above for new examples including id = 'f' and group = 1. src and dst order does not have to be fixed in my case. As long as 2 ids in the same group can be shown in the same row, it would satisfy the needs. — MAMS, Dec 29 '20 at 15:19
just do a self-join: `df.alias('d1').join(df.alias('d2'), ['group']).filter("d1.id < d2.id").toDF("group", "src", "dst")` — jxc, Dec 29 '20 at 16:02
@jxc I think you should post this as an answer. It is more straightforward than the other two answers. Your solution is only missing ```distinct()``` at the end (if we have, for example, two instances of (1,a), it will give us duplicate rows). — pegah, Feb 08 '21 at 14:20

Kafels · Accepted Answer · 2020-12-29T16:12:15.193

Edit 1

Not sure if it's the better way to solve, but I did a workaround:

import pyspark.sql.functions as f

df = df.withColumn('match', f.collect_set('id').over(Window.partitionBy('group')))

df = df.select(f.col('id').alias('src'),
               f.explode('match').alias('dst'),
               f.col('group'))

df = df.withColumn('duplicate_edges', f.array_sort(f.array('src', 'dst')))
df = (df
      .where(f.col('src') != f.col('dst'))
      .drop_duplicates(subset=['duplicate_edges'])
      .drop('duplicate_edges'))

df.sort('group', 'src', 'dst').show()

Output

+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  e|  a|    3|
+---+---+-----+

Original answer

Try this:

import pyspark.sql.functions as f

df = (df
      .groupby('group')
      .agg(f.first('id').alias('src'),
           f.last('id').alias('dst')))

df.show()

Output:

+-----+---+---+
|group|src|dst|
+-----+---+---+
|    1|  a|  c|
|    3|  e|  a|
|    2|  b|  d|
+-----+---+---+

What @Kafels proposes is absolutely right. However, do not forget to include the following at the start of your code: `import pyspark.sql.functions as f` — Marioanzas, Dec 29 '20 at 09:58
Thank you both for the answer and this is the great approach! The only thing missing is that when I have more than 2 ids in the same group, only the 1st and last ids will show as src and dst, but the others will be missed. For example, as what @jxc mentioned in the comment, if we have another record id = 'f' and group = 1, I'd expect a,c, f in group 1 to be appeared in result data frame. And the src & dst order doesn't really matter. I have updated my example in the question, would you be able to think of a way handling it? Thanks! — MAMS, Dec 29 '20 at 15:27

score 3 · Answer 2 · answered Dec 29 '20 at 16:22

You can do a self join:

df = df.toDF('src', 'group')
df2 = df.toDF('dst', 'group2')

result = df.join(
    df2,
    (df.group == df2.group2) & (df.src < df2.dst)
).select('src', 'dst', 'group').distinct().orderBy('group', 'src', 'dst')

result.show()
+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  a|  e|    3|
+---+---+-----+

How to create edge list from spark data frame in Pyspark?

2 Answers2

Edit 1

Original answer

Linked