Suppose I have a DataFrame
in Spark consisting of columns for id, date, and a number of properties (x, y, z, say). Unfortunately the DataFrame is very large. Fortunately, most of the records are "no-change" records, in which the id, x, y, and z are the same, and only the date changes. For example
| date | id | x |
| -------- | -- | - |
| 20150101 | 1 | 1 |
| 20150102 | 1 | 1 |
| 20150103 | 1 | 1 |
| 20150104 | 1 | 1 |
| 20150105 | 1 | 2 |
| 20150106 | 1 | 2 |
| 20150107 | 1 | 2 |
| 20150108 | 1 | 2 |
could be reduced to
| date | id | x |
| -------- | -- | - |
| 20150101 | 1 | 1 |
| 20150105 | 1 | 2 |
I originally thought that this function would do what I wanted
def filterToUpdates (df : DataFrame) = {
val colsData = df.column.filter(x => (x != "id" && x != "date"))
val window = Window.partitionBy(colsData).orderBy($"date".asc)
df.withColumn("row_num", row_number.over(window)).
select($"row_num" === 1).drop("row_num")
But that fails in the case when I have my data columns change, and then change back.
e.g.
| date | id | x |
| -------- | -- | - |
| 20150101 | 1 | 1 |
| 20150102 | 1 | 1 |
| 20150103 | 1 | 1 |
| 20150104 | 1 | 1 |
| 20150105 | 1 | 2 |
| 20150106 | 1 | 2 |
| 20150107 | 1 | 1 |
| 20150108 | 1 | 1 |
would be transformed to
| date | id | x |
| -------- | -- | - |
| 20150101 | 1 | 1 |
| 20150105 | 1 | 2 |
instead of what I want:
| date | id | x |
| -------- | -- | - |
| 20150101 | 1 | 1 |
| 20150105 | 1 | 2 |
| 20150107 | 1 | 1 |
This wouldn't be a difficult task in procedural code that got handed the records for row in order (partitioned by id and ordered by date), but I just can't see how to phrase it as a spark computation.
Note: this differs from Spark SQL window function with complex condition. I'm looking to filter out rows that differ from previous rows, which is an additional thing that could be done after the construction of the became_inactive
column in that question.