2

I have 2 large dataframes, edge and vertex, and I know that they need to be in special type Vertex and Edge RDDs, but every tutorial that I have found specifies the Edge and Vertex RDDs as arrays of 3 to 10 items. I need them to directly convert from a substantial RDD. How would I change a dataframe/normal RDD into the correct type?

I have followed the example here: https://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph but it enumerates all relationships and there are many in my use case.

  • edge df has 3 columns, (sourceID, destID, relationship)

  • vertex df has 2 columns (ID, Name)

What I have tried so far:

val vertex: RDD[(VertexId, String)] = sc.parallelize((vertexDF("ID"), vertexDF("Name")))

Returns Error:

error: type mismatch;
 found   : (org.apache.spark.sql.Column, org.apache.spark.sql.Column)
 required: Seq[(org.apache.spark.graphx.VertexId, String)]
    (which expands to)  Seq[(Long, String)]

How would I change a dataframe/normal RDD into the specialized vertex/edge RDD types?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Joe S
  • 410
  • 6
  • 16

1 Answers1

2

There is a graphframes spark library to handle dataframe-based graphs. It has a method to convert an edge and vertices dataframe pair into a GraphX RDD. See: http://graphframes.github.io/graphframes/docs/_site/user-guide.html#example-conversions.

For your example it will looks like:

val edgeDf = .... // (sourceID, destID, relationship)
val verexDf = .... // (ID, Name)
import org.graphframes._
val g = GraphFrame(
  verexDf.select($"id", $"name"), 
  edgeDf.select ($"sourceID" as "src", $"destID" as "dst", $"relationship"))
// Convert to GraphX
val gx: Graph[Row, Row] = g.toGraphX
Shaido
  • 27,497
  • 23
  • 70
  • 73
Artem Aliev
  • 1,362
  • 7
  • 12