Filtering a spark DataFrame down to updates only

Question

Suppose I have a DataFrame in Spark consisting of columns for id, date, and a number of properties (x, y, z, say). Unfortunately the DataFrame is very large. Fortunately, most of the records are "no-change" records, in which the id, x, y, and z are the same, and only the date changes. For example

| date     | id | x |
| -------- | -- | - |
| 20150101 | 1  | 1 |
| 20150102 | 1  | 1 |
| 20150103 | 1  | 1 |
| 20150104 | 1  | 1 |
| 20150105 | 1  | 2 |
| 20150106 | 1  | 2 |
| 20150107 | 1  | 2 |
| 20150108 | 1  | 2 |

could be reduced to

| date     | id | x |
| -------- | -- | - |
| 20150101 | 1  | 1 |
| 20150105 | 1  | 2 |

I originally thought that this function would do what I wanted

def filterToUpdates (df : DataFrame) = {
  val colsData = df.column.filter(x => (x != "id" && x != "date"))
  val window = Window.partitionBy(colsData).orderBy($"date".asc)
  df.withColumn("row_num", row_number.over(window)).
    select($"row_num" === 1).drop("row_num")

But that fails in the case when I have my data columns change, and then change back.

e.g.

| date     | id | x |
| -------- | -- | - |
| 20150101 | 1  | 1 |
| 20150102 | 1  | 1 |
| 20150103 | 1  | 1 |
| 20150104 | 1  | 1 |
| 20150105 | 1  | 2 |
| 20150106 | 1  | 2 |
| 20150107 | 1  | 1 |
| 20150108 | 1  | 1 |

would be transformed to

| date     | id | x |
| -------- | -- | - |
| 20150101 | 1  | 1 |
| 20150105 | 1  | 2 |

instead of what I want:

| date     | id | x |
| -------- | -- | - |
| 20150101 | 1  | 1 |
| 20150105 | 1  | 2 |
| 20150107 | 1  | 1 |

This wouldn't be a difficult task in procedural code that got handed the records for row in order (partitioned by id and ordered by date), but I just can't see how to phrase it as a spark computation.

Note: this differs from Spark SQL window function with complex condition. I'm looking to filter out rows that differ from previous rows, which is an additional thing that could be done after the construction of the became_inactive column in that question.

Possible duplicate of [Spark SQL window function with complex condition](https://stackoverflow.com/questions/42448564/spark-sql-window-function-with-complex-condition) — Ramesh Maharjan, Jun 12 '18 at 01:20

score 2 · Accepted Answer · answered Jun 11 '18 at 19:53

2

You can easily use lag. Window

val window = Window.partitionBy($"id").orderBy($"date".asc)

Column

import org.apache.spark.sql.functions.{coalesce, lag, lit}

val keep = coalesce(lag($"x", 1).over(window) =!= $"x", lit(true))

df.withColumn("keep", keep).where($"keep").drop("keep").show

// +--------+---+---+
// |    date| id|  x|
// +--------+---+---+
// |20150101|  1|  1|
// |20150105|  1|  2|
// |20150107|  1|  1|
// +--------+---+---+

answered Jun 11 '18 at 19:53

Alper t. Turker

34,230
9
83
115

This works for the example I showed where I have a single 'data' variable `x` that I want to watch for changes in. But what about the case where I have multiple columns (`x1`, `x2`, ..., `xn`, say), and I only know what those columns are called (and how many there are) from a `Seq[String]`. The problem in that case is that `lag` is not variadic. Should I be making a `Seq[Column]` of different versions of `"keep"`, and then do fold the thread `df` through the sequnce of columns and filter for them? – Harry Braviner Jun 14 '18 at 17:46
You'll need something like `val keep = cols.map(x => coalesce(lag(x, 1).over(window) =!= x, lit(true))).reduce(_ | _)` – Alper t. Turker Jun 14 '18 at 17:50

Filtering a spark DataFrame down to updates only

1 Answers1