I have a DataFrame in the following format:
character | title |
---|---|
Tony Stark | ["Iron Man"] |
James Buchanan Barnes | ["Captain America: The First Avenger","Captain America: The Winter Soldier","Captain America: Civil War","Avengers: Infinity War"] |
Marcus Bledsoe | ["Captain America: The Winter Soldier"] |
My goal is to create a GraphX representation where my vertices are Character
and Title
with the edges representing when a character has appeared in a movie. This is a sample data set and the real data will be much larger so this solution must be scalable across multiple executors.
I'm new to Scala and Spark. My strategy has been to create a characterVerticesRDD
, movieVerticesRDD
, and then combine them together.
I believe this is a correct way to build the characterVerticesRDD
:
val characterVerticesRDD: RDD[(VertexId, String)] = df.rdd.map(row => (MurmurHash3.stringHash(row.getString(0)), row.getString(0)))
The following is my first naive attempt. I realize now that using a Set
is invalid since it can't be shared across executors and using collect
is not going to work either in a scalable solution.
val movieVertices = scala.collection.mutable.Set[(Long, String)]()
df.rdd.collect.foreach(row => {
row.getAs[EmbeddedList]("title").elements.map { case d: String => d }.toList.foreach(movie => movieVertices += ((MurmurHash3.stringHash(movie), movie)))
})
val movieVerticesRDD: RDD[(VertexId, String)] = sc.parallelize(movieVertices.toList)
// combine vertices
val verticesRDD: RDD[(VertexId, String)] = characterVerticesRDD ++ movieVerticesRDD
What is the best way to build this movieVerticesRDD
given my DataFrame structure? I somehow need to iterate through the movie titles to create the vertices. I assume the strategy would be similar when creating edges since I'll need to iterate through each row of the data frame to create the edge between character and movie(s).
Thanks for any guidance.