2

I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions. The DStream on which I am applying mapWithState() is of type DStream[Row]. Before applying mapWithState(), I do this:

dstream.map(row=> (row.get(0), row))

Now my DStream is of type Tuple2[Any, Row]. On this DStream I apply mapWithState and here's how my updatefunction looks:

def stateUpdateFunction(): (Any, Option[Row], State[Map[String, Any]]) => Option[Row] = {
  (key, newData, stateData) => {
    if (stateData.exists()) {
      var oldState = stateData.get()
      stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
      Some(Row.fromSeq(newData.get.toSeq.++(Seq(oldState.get("count").get, oldState.get("sum").get))))
    }
    else {
      stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
      Some(Row.fromSeq(newData.get.toSeq.++(Seq[Any](null, null))))
    }
  }
}

Right now, update function only stores 2 values (per key) in the Map and appends the old values stored against "count" and "sum" to the input Row and returns. The state Map gets updated by the newly passed values in the input Row. My requirement is to be able to do complex operations on the input Row like we do on a DataFrame before storing them in the state Map. In other words I would like to be able to do something like this:

var transformedRow = originalRow.select(concat(upper($"C0"), lit("dummy")), lower($"C1") ...)

In the update-function I don't have access to SparkContext or SparkSession. So, I cannot create a single row DataFrame. If I could do that, applying DataFrame operations would not be difficult. I have all the column expressions defined for the transformed row.

Here's my sequence of operations: readState-> Perform complex DataFrame operations using this state on input row -> Perform more complex DataFrame operations to define new values for state.

Is it possible to fetch the SparkPlan/logicalPlan corresponding to a DataFrame query/operation and apply it on a single spark-sql Row ? I would very much appreciate any leads here. Please let me know if the question is not clear or some more details are required.

Kryptic Coder
  • 612
  • 2
  • 8
  • 20
  • 2
    Why not transform the rows *before* storing them in the state, and use `mapWithState` to store the updated row between iterations? – Yuval Itzchakov Apr 26 '17 at 07:07
  • thanks for your response. I have a similar requirement for updating the state as well (performing DataFrame operations). I have updated the question with the desired flow. – Kryptic Coder Apr 26 '17 at 08:23
  • I tried to create a single row DataFrame in stateUpdateFunction and tried applying the operations on it. I'm sure it's not the right way to do this because it involves creating a Singleton SparkContext and SQLContext in the executor. But this works in local mode but I run into some issues in cluster mode. Anyway, the question still remains when I can define transform operations on a DataFrame, why can I not do it on an individual row? – Kryptic Coder May 05 '17 at 10:07
  • My row in state-update function contains both columns (in the Row) from parent dstream and the previous state, to calculate total count for example. – Kryptic Coder May 05 '17 at 10:10
  • @KrypticCoder, Is it mandatory to use `mapWithState`? `mapWithState` should be used when you have to do same operation on sequence/collection of RDDs which you can not do with other inbuilt functions. – Ramesh Maharjan May 07 '17 at 02:10

1 Answers1

0

I've found a not-so-efficient solution to the given problem. With the known DataFrame operations we have, we can create an empty DataFrame with an already known schema. This DataFrame can give us the SparkPlan through

DataFrame.queryExecution.sparkPlan

This object is serializable and can be passed over to stateUpdateFunction. In the stateUpdateFunction, we can iterate over expressions contained in the passed SparkPlan, transforming it to replace unresolved attributes with corresponding literals:

sparkPlan.expressions.map(expr=>{
            expr.transform{
              case attr: AttributeReference =>
                println(s"Resolving ${attr.name} to ${map.getOrElse(attr.name, null)}, type: ${map.getOrElse(attr.name, null).getClass().toString()}")
                Literal(map.getOrElse(attr.name, null))
              case a => a
            }
          })

The map here refers to Row's column-value pairs. On these transformed expressions we call eval passing it empty InternalRow. This gives us results corresponding to every expression. Because this method involves interpreted evaluation and doesn't employ code generation, it will be inefficient to use this in a real world use-case. But I'll dig further to find out how code generation can be leveraged here.

Kryptic Coder
  • 612
  • 2
  • 8
  • 20