Shortcuts for creating complicated Column structures in Spark

Question

I am porting some Graph.pregel algorithms to GraphFrame.aggregateMessages. I'm finding the GraphFrame APIs a little cumbersome.

In the Graph APIs, I can send a case class as my message type. But in the GraphFrame APIs, aggregateMessages.sendToSrc and .sendToDst work on either a SQL expression String, or on a Column. I'm finding this to be as powerful as it is a pain in the ass.

Say you have:

case class Vote(yay: Boolean, voters: Long = 1L)
case class Send(vote: Vote, from: Long)

Using GraphX and the pregel function, I can build a sendMsg that returns Iterator[(VertexId,Send)], which could be something like: Iterator((1L, Send(Vote(yay = true), from = 2L) ))

With GraphFrames I have to build a Column that serves the same purpose as Iterator[(VertexId,Send)], ideally without completely abandoning my already-defined case classes (way more complicated than the sample above).

What kind of shortcuts are there to doing that?

What I got so far:

It was pretty easy to convert an instance of a case class into a corresponding struct. This mostly gets me there:

def ccToStruct(cc: Product) : Column = {
  val values = cc.productIterator
  var seq = Seq[Column]()
  while (values.hasNext) {
    val field = values.next() match {
      case p: Some[Product @unchecked] if (p.get.productArity > 0) => ccToStruct(p.get)
      case p: Product if (p.productArity > 0) => ccToStruct(p)
      case x => lit(x)
    }
    seq = seq :+ field
  }
  struct(seq:_*)
}

This lets me do:

ccToStruct(Send(Vote(true, 1L), 123L))
// res4: org.apache.spark.sql.Column = struct(struct(true,1),123)

I would have to patch the schema up a little bit to make it work correctly, but before I started to do that I realized this is a totally useless approach. You never really want to convert a case class value to a struct -- ccToStruct(Send(Vote(true, 1L), 123L)) creates a pretty useless message. It's the equivalent of sending a lit(Send(..)) value -- except that lit() doesn't support case classes.

What you want to do instead is to mix and match lit values with AM.dst("*") and AM.src("*") columns, but to do so corresponding to the schema of the case class. (I thought about abandoning case classes altogether, but I have a UDAF to sum my messages, and that logic was very easy to port as long as I keep using case classes.)

I believe the answer is to be able to create a structure like this:

import org.graphframes.lib.AggregateMessages
val AM = AggregateMessages

val msg = Seq[Any](Seq[Any](true, 1L), AM.src("id"))

And then to convert that to a Column using struct() and the schema of my case class.

If nobody has a better way to do this (and probably even if someone does) I'll answer my own question with the solution later.

David Griffin · Answer 1 · 2016-04-24T05:13:44.400

Here's what I came up with.

For what I want to do, which is to create Column objects with the structure of case classes but with the ability to bind to DataFrame.columns, I decided my primary data structure should be a Seq[Any]. The Seq should match the structure of my case class -- the Seq is basically the constructor arguments of the case class. If my case class is:

case class Vote(yay: Boolean, voters: Long)

Then I could create the following Vote-like Seq:

val voteSeq = Seq[Any](true, 1L)

But the reason I have to use a Seq[Any], is because even more interestingly, I can create:

val boundVote = Seq[Any](true, AM.edge("voters"))

I came up with a couple of functions that can be used to convert the Seq to a Column. I create the Column with the SQL function struct(). You could do this all with SQL string expressions, too. But I decided to go with Columns instead. I wanted to make it as clean as possible, and String SQL expressions get messy.

If you do not name your columns correctly within your struct, you get structs that look like:

  vote: struct (nullable = false)
   |-- col1: boolean (nullable = false)
   |-- col2: long (nullable = false)

That's gonna suck later on trying when you're trying to convert that into a case class. Instead you have to use as for all your columns, so you get:

  vote: struct (nullable = false)
   |-- yay: boolean (nullable = false)
   |-- voters: long (nullable = false)

The solution is to take a StructType and use that to create the field names. As it turns out, I had already covered automatically deriving a StructType from a case class -- so that was the easy part. The first function does the hard part -- it recursively walks through both the Seq and the schema and generates Columns that ultimately get wrapped up in a final: struct(colSeq:_*)

def seqToColumnSchema(anySeq: Seq[Any], schema: StructType) : Column = {
  var colSeq = Seq[Column]()
  anySeq.zip(schema.fields).foreach{ case (value, field) => {
    colSeq = colSeq :+ (value match {
      case c: Column => c as field.name
      case p: Seq[Any] if (p.length > 0) => {
        field.dataType match {
          case s: StructType => seqToColumnSchema(p, s) as field.name
          case a: ArrayType => array(p.map(v => lit(v)):_*) as field.name
          case x => lit(x) as field.name
        }
      }
      case x => lit(x) as field.name
    })
  }}
  struct(colSeq:_*)
}

This second function is just a wrapper around the first, but it lets you do:

seqToColumn[Vote](Seq(true, AM.edge("voters")))

Instead of having to provide the StructType, you only have to give the name of the case class inside the [...]

import org.apache.spark.sql.catalyst.ScalaReflection    

def seqToColumn[T: TypeTag](anySeq: Seq[Any]) : Column = {
  val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]
  seqToColumnSchema(anySeq, schema)  
}

All of that, just so that I can do this:

import org.graphframes.lib.AggregateMessages
val AM = AggregateMessages

case class Vote(yay: Boolean, voters: Long)

val voteSeq = Seq[Any](true, AM.edge("voters"))
val voteMsg = seqToColumn[Vote](voteSeq)
// voteMsg: org.apache.spark.sql.Column = struct(true AS yay#18,edge[voters] AS voters#19)

graphFrame.aggregateMessages.sendToDst(voteMsg).agg(voteSum(AM.msg) as "out").printSchema
root
 |-- id: long (nullable = false)
 |-- out: struct (nullable = true)
 |    |-- vote: struct (nullable = false)
 |    |    |-- yay: boolean (nullable = false)
 |    |    |-- voters: long (nullable = false)

Nope no bug -- you just have to have to specify `Seq[Any]` because `Seq(1.0, 4L, 123)` is not the same as `Seq[Any](1.0, 4L, 123)` and only the second one doesn't squash your values into compatible types. — David Griffin, Apr 23 '16 at 17:42

Shortcuts for creating complicated Column structures in Spark

1 Answers1