Spark Dataframe access of previous calculated row

Question

I have following Data:

+-----+-----+----+
|Col1 |t0   |t1  |
+-----+-----+----+
| A   |null |20  |
| A   |20   |40  |
| B   |null |10  |
| B   |10   |20  |
| B   |20   |120 |
| B   |120  |140 |
| B   |140  |320 |
| B   |320  |340 |
| B   |340  |360 |
+-----+-----+----+

And what I want is something like this:

+-----+-----+----+----+
|Col1 |t0   |t1  |grp |
+-----+-----+----+----+
| A   |null |20  |1A  |
| A   |20   |40  |1A  |
| B   |null |10  |1B  |
| B   |10   |20  |1B  |
| B   |20   |120 |2B  |
| B   |120  |140 |2B  |
| B   |140  |320 |3B  |
| B   |320  |340 |3B  |
| B   |340  |360 |3B  |
+-----+-----+----+----+

Explanation: The extra column is based on the Col1 and the difference between t1 and t0. When the difference between that two is too high => a new number is generated. (in the dataset above when the difference is greater than 50)

I build t0 with:

val windowSpec = Window.partitionBy($"Col1").orderBy("t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)

Can someone help me how to do it? I searched but didn't get a good idea. I'm a little bit lost because I need the value of the previous calculated row of grp...

Thanks

score 3 · Accepted Answer · edited May 23 '17 at 12:18

I solved it myself

val grp =  (coalesce(
      ($"t" - lag($"t", 1).over(windowSpec)),
      lit(0)
    ) > 50).cast("bigint")

df = df.withColumn("grp", sum(grp).over(windowSpec))

With this I don't need both colums (t0 and t1) anymore but can use only t1 (or t) without compute t0.

(I only need to add the value of Col1 but the most important part the number is done and works fine.)

I got the solution from: Spark SQL window function with complex condition

thanks for your help

Ramesh Maharjan · Answer 2 · 2017-05-18T08:09:45.463

You can use udf function to generate the grp column

def testUdf = udf((col1: String, t0: Int, t1: Int)=> (t1-t0) match {
  case x : Int if(x > 50) => 2+col1
  case _ => 1+col1
})

Call the udf function as

df.withColumn("grp", testUdf($"Col1", $"t0", $"t1"))

The udf function above won't work properly due to null values in t0 which can be replaced by 0

df.na.fill(0)

I hope this is the answer you are searching for.

Edited
Here's the complete solution using udaf . The process is complex . You've already got easy answer but it might help somebody who might use it
First defining udaf

class Boendal extends UserDefinedAggregateFunction {

  def inputSchema = new StructType().add("Col1", StringType).add("t0", IntegerType).add("t1", IntegerType).add("rank", IntegerType)
  def bufferSchema = new StructType().add("buff", StringType).add("buffer1", IntegerType)
  def dataType = StringType
  def deterministic = true

  def initialize(buffer: MutableAggregationBuffer) = {
    buffer.update(0, "")
    buffer.update(1, 0)
  }

  def update(buffer: MutableAggregationBuffer, input: Row) = {
    if (!input.isNullAt(0)) {
      val buff = buffer.getString(0)
      val col1 = input.getString(0)
      val t0 = input.getInt(1)
      val t1 = input.getInt(2)
      val rank = input.getInt(3)
      var value = 1
      if((t1-t0) < 50)
        value = 1
      else
        value = (t1-t0)/50

      val lastValue = buffer(1).asInstanceOf[Integer]
      // if(!buff.isEmpty) {
      if (value < lastValue)
        value = lastValue
      // }
      buffer.update(1, value)

      var finalString = ""
      if(buff.isEmpty){
        finalString = rank+";"+value+col1
      }
      else
        finalString = buff+"::"+rank+";"+value+col1

      buffer.update(0, finalString)
    }
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    val buff1 = buffer1.getString(0)
    val buff2 = buffer2.getString(0)
    buffer1.update(0, buff1+buff2)
  }

  def evaluate(buffer: Row) : String = {
    buffer.getString(0)
  }
}

Then some udfs

def rankUdf = udf((grp: String)=> grp.split(";")(0))
def removeRankUdf = udf((grp: String) => grp.split(";")(1))

And finally call the udaf and udfs

val windowSpec = Window.partitionBy($"Col1").orderBy($"t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
  .withColumn("rank", rank() over windowSpec)
df = df.na.fill(0)
val boendal = new Boendal
val df2 = df.groupBy("Col1").agg(boendal($"Col1", $"t0", $"t1", $"rank").as("grp2")).withColumnRenamed("Col1", "Col2")
    .withColumn("grp2", explode(split($"grp2", "::")))
    .withColumn("rank2", rankUdf($"grp2"))
    .withColumn("grp2", removeRankUdf($"grp2"))

df = df.join(df2, df("Col1") === df2("Col2") && df("rank") === df2("rank2"))
  .drop("Col2", "rank", "rank2")
df.show(false)

Hope it helps

I think this could work when it's always only 2 groups. But it is always unknown how many groups I have. It could be only one or also hundred. — Boendal, May 17 '17 at 10:17
your fourth column depends on first, second and third column only, it doesn't depend on grouping of col1. Isn't it? your grouping is done in Window function you mentioned when you generated t0 column. — Ramesh Maharjan, May 17 '17 at 10:52
I updated the example. Now your solution won't work anymore. That's what I mean. — Boendal, May 17 '17 at 11:23
Can you also explain how you are generating 1A, 2A , 1B, 3B ... so that I can help you achieve that. — Ramesh Maharjan, May 17 '17 at 11:29
A & B are from Col1. The number always increase when the difference between t0 and t1 is higher than 50. that's why I have to need which number was before in the grp column so I know which number I have to increase/fill in. — Boendal, May 17 '17 at 11:33
In `| B |340 |360 |3B |`, how `t1 - t0` is greater than 50? please help me understand — Ramesh Maharjan, May 17 '17 at 11:47
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/144453/discussion-between-ramesh-maharjan-and-boendal). — Ramesh Maharjan, May 17 '17 at 11:49
okay I'm in the chat but for others: i, sure they are not greater als 50 but when they are not greater than 50 then it's the value from the row before. (forgot to mention this) — Boendal, May 17 '17 at 12:06

Spark Dataframe access of previous calculated row

2 Answers2

Linked