I am porting some Graph.pregel
algorithms to GraphFrame.aggregateMessages
. I'm finding the GraphFrame
APIs a little cumbersome.
In the Graph
APIs, I can send a case class
as my message type. But in the GraphFrame
APIs, aggregateMessages.sendToSrc
and .sendToDst
work on either a SQL expression String
, or on a Column
. I'm finding this to be as powerful as it is a pain in the ass.
Say you have:
case class Vote(yay: Boolean, voters: Long = 1L)
case class Send(vote: Vote, from: Long)
Using GraphX
and the pregel
function, I can build a sendMsg
that returns Iterator[(VertexId,Send)]
, which could be something like: Iterator((1L, Send(Vote(yay = true), from = 2L) ))
With GraphFrames
I have to build a Column
that serves the same purpose as Iterator[(VertexId,Send)]
, ideally without completely abandoning my already-defined case classes
(way more complicated than the sample above).
What kind of shortcuts are there to doing that?
What I got so far:
It was pretty easy to convert an instance of a case class
into a corresponding struct. This mostly gets me there:
def ccToStruct(cc: Product) : Column = {
val values = cc.productIterator
var seq = Seq[Column]()
while (values.hasNext) {
val field = values.next() match {
case p: Some[Product @unchecked] if (p.get.productArity > 0) => ccToStruct(p.get)
case p: Product if (p.productArity > 0) => ccToStruct(p)
case x => lit(x)
}
seq = seq :+ field
}
struct(seq:_*)
}
This lets me do:
ccToStruct(Send(Vote(true, 1L), 123L))
// res4: org.apache.spark.sql.Column = struct(struct(true,1),123)
I would have to patch the schema up a little bit to make it work correctly, but before I started to do that I realized this is a totally useless approach. You never really want to convert a case class
value to a struct
-- ccToStruct(Send(Vote(true, 1L), 123L))
creates a pretty useless message. It's the equivalent of sending a lit(Send(..))
value -- except that lit()
doesn't support case classes.
What you want to do instead is to mix and match lit
values with AM.dst("*")
and AM.src("*")
columns, but to do so corresponding to the schema of the case class
. (I thought about abandoning case classes altogether, but I have a UDAF
to sum
my messages, and that logic was very easy to port as long as I keep using case classes.)
I believe the answer is to be able to create a structure like this:
import org.graphframes.lib.AggregateMessages
val AM = AggregateMessages
val msg = Seq[Any](Seq[Any](true, 1L), AM.src("id"))
And then to convert that to a Column
using struct()
and the schema of my case class.
If nobody has a better way to do this (and probably even if someone does) I'll answer my own question with the solution later.