2

From this, "A GraphFrame can also be constructed from a single DataFrame containing edge information. The vertices will be inferred from the sources and destinations of the edges."

However when I look into its API doc, it seems there is no way to create one.

Has someone tried to create a GraphFrame using edge DataFrame only? How?

3 Answers3

5

In order to avoid duplicates in the vertices list I would add a distinct

verticesDf=edgesDf \
     .select("src") \ 
     .union(edgesDf.select("dst")) \
     .distinct() \
     .withColumnRenamed('src', 'id')

verticesDf.show()

graph=GraphFrame(verticesDf,edgesDf)
Alex Ortner
  • 1,097
  • 8
  • 24
4

The graphframes scala API has a function called fromEdges which generates a graphframe from a edge dataframe. As far as I can overlook it this function isn't avaiable in pyspark, but you can do something like:

##something

verticesDf = edgesDF.select('src').union(edgesDF.select('dst'))
verticesDf = verticesDf.withColumnRenamed('src', 'id')

##more something

to achieve the same.

cronoik
  • 15,434
  • 3
  • 40
  • 78
0

Here is another alternative that doesn't read the whole data twice:

nodes = (
    edges
    .withColumn("id", F.explode(F.array(F.col("src"), F.col("dst"))))
    .select("id")
    .distinct()
)
Mathias Longo
  • 71
  • 1
  • 5