Spark combine DataFrames and GraphX

Question

Is it possible to combine GraphX and DataFrames? I want for every node in the Graph an own DataFrame. I know that GraphX and DataFrame extends RDD and nested RDDs are not possible and SparkContext is not Serializable. But in Spark 2.0.0 I saw that SparkSession is Serializable. I've tried it, but it's still not working. I've also tried to store the DataFrames global in an Array. But I cant access the Array in a workernode. Ignore the methods sendMsg and merge:

object Main{
  def main(args: Array[String]) : Unit = {    
    val spark = SparkSession
      .builder
      .appName("ScalaGraphX_SQL")
      .master("spark://home:7077")
      .enableHiveSupport()
      .getOrCreate()

    val sc = spark.sparkContext

    val node_pair : RDD[(Array[String],Long)] = sc.textFile(args(0)).map(l=>l.split(" ")).zipWithIndex()

    //set array size
    Tables.tables = new Array[Dataset[Row]](node_pair.count().toInt)

    //insert dataframe inside array tables
    node_pair.collect().foreach{ case (arr,l) => {
        val fields = arr.takeRight(arr.length-2).map(fieldName => StructField(fieldName, BooleanType, nullable = true))
        val schema = StructType(fields)
        val rows = new util.ArrayList[Row]
        Tables.tables{l.toInt} = spark.createDataFrame(rows, schema)
        //val f =
      }
    }

    //create vertices
    val vertices : RDD[(VertexId,TreeNode)]= node_pair.map{ case (arr,l) => {
      (l,new TreeNode(l,false))
     }
    }

    //create edges
    val edges : RDD[Edge[Boolean]] = node_pair
      .filter{ case (arr,l) => arr(0).toLong != -1}
      .map{ case (arr,l) => Edge(l,arr(0).toLong,true)
      }

    var init_node : TreeNode =  new TreeNode(-1,false)
    val graph = Graph(vertices,edges,init_node)
    val graph_pregel = Pregel(graph,init_node,Int.MaxValue,EdgeDirection.Out)(vProg,sendMsg,merge)

    graph_pregel.vertices.collect().foreach(v => println(v._2.index))
  }

  def vProg(id:VertexId, act: TreeNode, other: TreeNode): TreeNode = {
    println(Tables.tables{act.index.toInt})
    act
  }

  def sendMsg(et : EdgeTriplet[TreeNode,Boolean]) : Iterator[(VertexId, TreeNode)] = {

    if(et.srcAttr.v){
      println(et.srcId + "--->" + et.dstId)
      Iterator((et.dstId,et.srcAttr))
    }else{
      //println(et.srcId + "-/->" + et.dstId)
      Iterator.empty
    }
  }

  def merge(n1:TreeNode, n2:TreeNode): TreeNode = {
    n1
  }
}

object Tables extends Serializable{
  var tables : scala.Array[Dataset[Row]] = null
}

class TreeNode(val index:Long, var v: Boolean) extends Serializable {
}

Maybe there is a possibility to access the global array with RDDs? Or someone has an other solution for this problem?

The problem is not and never has been serialization. Not serializable is just a hint here which points to the main issue that Spark architecture is not suitable for nested processing without significantly limiting programming model. So just because you can serialize `SparkSession` (you could serialize `SQLContext` in 1.x the same way) it doesn't mean anything changed. — zero323, Sep 19 '16 at 11:52

score 2 · Answer 1 · edited Oct 04 '16 at 17:50

2

Please take a look at GraphFrames - it's a package that provides DataFrame API for GraphX. GraphFrames will be considered for inclusion into Spark once it provides functionality such as partitioning that is important in GraphX and when the API is tested more exhaustively.

For the problem described in the comment below, you've got one DataFrame with nodes, i.e. Airports:

val airports = sqlContext.createDataFrame(List(
    ("A1", "Wrocław"),
    ("A2", "London"),
    ("A3", "NYC")
)).toDF("id", "name")

ID is unique. You can create other DataFrame, i.e. detailsDF, with structure like: ID | AirPortID | other data. Then you've got One-to-many and for one Airport (so GraphFrame verticle) you've got many entries in detailsDF. Now you can query: spark.sql("select a.name, d.id as detailID from airports a join detailsDF d on a.id = d.airportID");. You can also have many columns in Airports DataFrame if you want to store additional information there

edited Oct 04 '16 at 17:50

Jorge

191
2
8

answered Sep 19 '16 at 08:17

T. Gawęda

15,706
4
46
61

Thanks, but arent GraphFrames, Graphs structed as DataFrames? I need a Graph with DataFrames inside the nodes. It's like a table for each node. Or have I misunderstood GraphFrames? – Vitali D. Sep 19 '16 at 11:54
Yes and not :) In GraphFrames there is a table (DataFrame) for each node. However this node can have some ID and then there can be another DataFrame, i.e. NodeDetails which will have column "baseNodeId". Then you can have many rows for one node – T. Gawęda Sep 19 '16 at 12:22
I'm sorry, but I dont understand. Is that not a nested DataFrame? Can u give me a short example? Thank you very much! – Vitali D. Sep 19 '16 at 14:26
@VitaliD. I've updated answer - let me know if it's clear now :) – T. Gawęda Sep 19 '16 at 14:42

Spark combine DataFrames and GraphX

1 Answers1