I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState()
which is defined in PairDStreamFunctions
. The DStream on which I am applying mapWithState()
is of type DStream[Row]
. Before applying mapWithState()
, I do this:
dstream.map(row=> (row.get(0), row))
Now my DStream is of type Tuple2[Any, Row]
. On this DStream I apply mapWithState and here's how my updatefunction looks:
def stateUpdateFunction(): (Any, Option[Row], State[Map[String, Any]]) => Option[Row] = {
(key, newData, stateData) => {
if (stateData.exists()) {
var oldState = stateData.get()
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq(oldState.get("count").get, oldState.get("sum").get))))
}
else {
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq[Any](null, null))))
}
}
}
Right now, update function only stores 2 values (per key) in the Map and appends the old values stored against "count" and "sum" to the input Row and returns. The state Map gets updated by the newly passed values in the input Row. My requirement is to be able to do complex operations on the input Row like we do on a DataFrame before storing them in the state Map. In other words I would like to be able to do something like this:
var transformedRow = originalRow.select(concat(upper($"C0"), lit("dummy")), lower($"C1") ...)
In the update-function I don't have access to SparkContext or SparkSession. So, I cannot create a single row DataFrame. If I could do that, applying DataFrame operations would not be difficult. I have all the column expressions defined for the transformed row.
Here's my sequence of operations: readState-> Perform complex DataFrame operations using this state on input row -> Perform more complex DataFrame operations to define new values for state.
Is it possible to fetch the SparkPlan/logicalPlan corresponding to a DataFrame query/operation and apply it on a single spark-sql Row ? I would very much appreciate any leads here. Please let me know if the question is not clear or some more details are required.