I have 2 large dataframes, edge
and vertex
, and I know that they need to be in special type Vertex
and Edge
RDDs, but every tutorial that I have found specifies the Edge
and Vertex
RDDs as arrays of 3 to 10 items. I need them to directly convert from a substantial RDD. How would I change a dataframe/normal RDD into the correct type?
I have followed the example here: https://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph but it enumerates all relationships and there are many in my use case.
edge
df has 3 columns, (sourceID, destID, relationship)vertex
df has 2 columns (ID, Name)
What I have tried so far:
val vertex: RDD[(VertexId, String)] = sc.parallelize((vertexDF("ID"), vertexDF("Name")))
Returns Error:
error: type mismatch;
found : (org.apache.spark.sql.Column, org.apache.spark.sql.Column)
required: Seq[(org.apache.spark.graphx.VertexId, String)]
(which expands to) Seq[(Long, String)]
How would I change a dataframe/normal RDD into the specialized vertex/edge RDD types?