Sum the Distance in Apache-Spark dataframes

Question

The Following code gives a dataframe having three values in each column as shown below.

import org.graphframes._
    import org.apache.spark.sql.DataFrame
    val v = sqlContext.createDataFrame(List(
      ("1", "Al"),
      ("2", "B"),
      ("3", "C"),
      ("4", "D"),
      ("5", "E")
    )).toDF("id", "name")

    val e = sqlContext.createDataFrame(List(
      ("1", "3", 5),
      ("1", "2", 8),
      ("2", "3", 6),
      ("2", "4", 7),
      ("2", "1", 8),
      ("3", "1", 5),
      ("3", "2", 6),
      ("4", "2", 7),
      ("4", "5", 8),
      ("5", "4", 8)
    )).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show

OutPut of Above Code is given below::

    +-------+-------+-------+                                                       
    |     e0|     e1|     e2|
    +-------+-------+-------+
    |[1,2,8]|[2,4,7]|[4,5,8]|
    +-------+-------+-------+

In the above output, we can see that each column has three values and they can be interpreted as follows.

e0 : 
source 1, Destination 2 and distance 8  

e1:
source 2, Destination 4 and distance 7

e2:
source 4, Destination 5 and distance 8

basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?

score 5 · Accepted Answer · 2016-12-08T23:39:46.620

5

It can be done like this:

val total = df.columns.filter(_.startsWith("e"))
 .map(c => col(s"$c.property")) // or col(c).getItem("property")
 .reduce(_ + _)

df.withColumn("total", total)

edited Dec 08 '16 at 23:39

answered Dec 08 '16 at 17:33

2

Is `.property` meant to be a general placeholder for the element of the column you are trying to access? – evan.oman Dec 08 '16 at 17:41
2

@evan058 Columns OP tries to access are Grapframes edges. There are represented as structs with three fields (`src`, `dst`, `property`). So it is element of a column. – Dec 08 '16 at 17:44

evan.oman · Answer 2 · 2016-12-08T16:42:13.193

I would make a collection of the columns to sum and then use a foldLeft on a UDF:

scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]

scala> df.show
+---------+---------+---------+
|       e0|       e1|       e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+

scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2) 

scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))

scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
|       e0|       e1|       e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|  23|
+---------+---------+---------+----+

Sum the Distance in Apache-Spark dataframes

2 Answers2