0

I have table1

number  type   time
--------------------
1       on     1
1       on     5
1       off    10
1       off    15

I need to transform it into table2

number  type   time    
--------------------
1       on     1
1       off    2
1       on     5
1       off    10
1       on     14
1       off    15

The conditions are if $"type" === "ON" && $"lag_type" =!= "OFF" I copy lagging row, but replace type with OFF and increment time by a second. If $"type" =!= "ON" && $"lag_type" === "OFF" I copy current row, but replace type with ON and decrease time by a second. If lag type is null skip.

I found which rows should be copied

val window = Window.partitionBy($"number").orderBy($"time")
df
.withColumn("lag_type", lag($"type", 1, null).over(window))
.withColumn("lag1", $"type" === "ON" && $"lag_type" =!= "OFF")
.withColumn("lag2", $"type" =!= "ON" && $"lag_type" === "OFF")

But don't know how to add rows. Especially if they are based on the values from the lagging one. For the current one, I could probably make lists

type       time
(on, off)  (14, 15)

and explode them like it's show here, but again I'm lost when it comes to the lagging row.

Any advice? I'm using spark 2.2. Is it possible with case classes?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
gjin
  • 860
  • 1
  • 14
  • 28

1 Answers1

0

If I understand your question correctly, you want to add rows ONLY FOR the cases where you find lag1 or lag2 is true .

One way of doing it :

//Filter only the rows that needs to be changed : 

val df2 = df.withColumn("lag_type", lag($"type", 1, null).over(window))
.withColumn("lag1", $"type" === "ON" && $"lag_type" =!= "OFF")
.withColumn("lag2", $"type" =!= "ON" && $"lag_type" === "OFF")
.filter( $"lag1"|| $"lag2")

 
//Change the rows based on the values of lag1 and lag2 
//then drop extra columns

val newChangedDf = df2
.withColumn("time",when($"lag1", $"time"+1).otherwise($"time"-1))
.withColumn("type",when($"lag1", lit("OFF")).otherwise(lit("ON")))
.drop("lag_type","lag1","lag2")

//Finally add them to the original df. 

val finalDf = df.union(newChangedDf)

Note : This doesn't handle the case when lag1 and lag2 both are true . Please manipulate the above code as per your requirement.

Sanket9394
  • 2,031
  • 1
  • 10
  • 15