Specify default value for rowsBetween and rangeBetween in Spark

Question

I have a question concerning a window operation in Sparks Dataframe 1.6.

Let's say I have the following table:

id|MONTH  |number
1  201703  2
1  201704  3
1  201705  7
1  201706  6

At moment I'm using the rowsBetween function:

val window = Window.partitionBy("id")
  .orderBy(asc("MONTH"))
  .rowsBetween(-2, 0)

randomDF.withColumn("counter", sum(col("number")).over(window))

This gives me following results:

id|MONTH  |number |counter
1  201703  2       2
1  201704  3       5
1  201705  7       12
1  201706  6       16

What I wan't to achieve is setting a default value (like in lag() and lead()) when there are no prescending rows. For example: '0' so that I get results like:

id|MONTH  |number |counter
1  201703  2       0
1  201704  3       0
1  201705  7       12
1  201706  6       16

I've already looked in the documentation but Spark 1.6 does not allow this, and I was wondering if there was some kind of workaround.

Many thanks !

Would you be interested in a workaround in terms of additional `withColumn` operation? — astro_asz, Feb 15 '18 at 14:05
maybe if that is possible, I was thinking of windowing over the presending rows and putting them in a list using `collect_list()`. Then in the next `withColumn` operation checking if that list contains 3 values otherwise setting the value to 0. But `collect_list()` in a window operation is not working in version 1.6 — RudyVerboven, Feb 15 '18 at 14:19

score 2 · Answer 1 · answered Feb 15 '18 at 14:31

How about something like this where:

add additional lag step
substitute values with case

Code

val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
  Seq(
    Row(1, 1, 201703, 2),
    Row(2, 1, 201704, 3),
    Row(3, 1, 201705, 7),
    Row(4, 1, 201706, 6)))

val schema: StructType = new StructType()
  .add(StructField("sortColumn",     IntegerType,  false))
  .add(StructField("id",     IntegerType,  false))
  .add(StructField("month",  IntegerType, false))
  .add(StructField("number",  IntegerType, false))

val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)

val prevRows = 2

val window = Window.partitionBy("id")
  .orderBy(col("month"))
  .rowsBetween(-prevRows, 0)

val window2 = Window.partitionBy("id")
  .orderBy(col("month"))

val df2 = df0.withColumn("counter", sum(col("number")).over(window))
val df3 = df2.withColumn("myLagTmp", lag(lit(1), prevRows).over(window2))
val df4 = df3.withColumn("counter", expr("case when myLagTmp is null then 0 else counter end")).drop(col("myLagTmp"))
df4.sort("sortColumn").show()

score 0 · Accepted Answer · answered Feb 16 '18 at 13:42

Thanks to the answer of @astro_asz i've came up with the following solution:

val numberRowsBetween = 2
val window1 = Window.partitionBy("id").orderBy("MONTH")
val window2 = Window.partitionBy("id")
      .orderBy(asc("MONTH"))
      .rowsBetween(-(numberRowsBetween - 1), 0)

randomDF.withColumn("counter", when(lag(col("number"), numberRowsBetween , 0).over(window1) === 0, 0)
                .otherwise(sum(col("number")).over(window2)))

This solution will put a '0' as default value.

Specify default value for rowsBetween and rangeBetween in Spark

2 Answers2