0

I want to remove consecutive duplicates within a subset of columns from the dataframe. I found a solution on how to do it but only for single columns here

Having a dataframe like this:

test_df = spark.createDataFrame([
  (2,3.0,"a", "2020-01-01"),
  (2,6.0,"a", "2020-01-02"),
  (3,2.0,"a", "2020-01-02"),
  (4,1.0,"b", "2020-01-04"),
  (4,9.0,"b", "2020-01-05"),
  (4,7.0,"b", "2020-01-05"),
  (2,3.0,"a", "2020-01-08"),
  (4,7.0,"b", "2020-01-09")
], ("id", "num","st", "date"))

##############
id   num   st    date
2,   3.0,  "a"  "2020-01-01"
2,   6.0,  "a"  "2020-01-02"
3,   2.0,  "a"  "2020-01-02"
4,   1.0,  "b"  "2020-01-04"
4,   9.0,  "b"  "2020-01-05"
4,   7.0,  "b"  "2020-01-05"
2,   3.0,  "a"  "2020-01-08"
4,   7.0,  "b"  "2020-01-09"

I want to remove consecutive duplicates in an specific set of columns [id,st] keeping the first record (ordered by date) when consecutive cases appear. If two samples appear on same day and cant be properly ordered it can be choosen at random. The result would look like:

##############
id   num   st    date
2,   3.0,  "a"   "2020-01-01"
3,   2.0,  "a"   "2020-01-02"
4,   1.0,  "b"   "2020-01-04"
2,   3.0,  "a"   "2020-01-08"
4,   7.0,  "b"   "2020-01-09"

How could I do that?

Kafels
  • 3,864
  • 1
  • 15
  • 32
eljiwo
  • 687
  • 1
  • 8
  • 29
  • what is the rule to order the lines? i believe you keep the first the record but how do you define the proper order ? – Steven Aug 09 '21 at 12:40
  • @Steven yes first record, sr I didn't explicitly mentioned it. Edited the main post – eljiwo Aug 09 '21 at 12:42
  • this kind of operation has already been answered several times. Here is one. https://stackoverflow.com/questions/68076603/aggregate-over-time-windows-on-a-partitioned-grouped-by-window/68086351#68086351 – Steven Aug 09 '21 at 13:23
  • Does this answer your question? [Aggregate over time windows on a partitioned/grouped by window](https://stackoverflow.com/questions/68076603/aggregate-over-time-windows-on-a-partitioned-grouped-by-window) – Steven Aug 09 '21 at 13:23
  • I'm not able to use your code sample. In theory it seems like it should work for this use case as well but when creating the ID column all my rows are labeled as id=1. When running the last df4 code then it wont be the expected result – eljiwo Aug 09 '21 at 13:41

1 Answers1

1

You can concat these columns id and st and verify if the previous value is the same as the current using lag function.

from pyspark.sql import Window

import pyspark.sql.functions as f

window = Window.orderBy(f.col('date'))
before_dedup_df = (test_df
                   .withColumn('_custom_id', f.concat(f.col('id'), f.col('st')))
                   .withColumn('_consecutive', f.col('_custom_id').eqNullSafe(f.lag('_custom_id').over(window))))
# +---+---+---+----------+----------+------------+
# |id |num|st |date      |_custom_id|_consecutive|
# +---+---+---+----------+----------+------------+
# |2  |3.0|a  |2020-01-01|2a        |false       |
# |2  |6.0|a  |2020-01-02|2a        |true        |
# |3  |2.0|a  |2020-01-02|3a        |false       |
# |4  |1.0|b  |2020-01-04|4b        |false       |
# |4  |9.0|b  |2020-01-05|4b        |true        |
# |4  |7.0|b  |2020-01-05|4b        |true        |
# |2  |3.0|a  |2020-01-08|2a        |false       |
# |4  |7.0|b  |2020-01-09|4b        |false       |
# +---+---+---+----------+----------+------------+

dedup_df = (before_dedup_df
            .where(~f.col('_consecutive'))
            .drop('_custom_id', '_consecutive'))
# +---+---+---+----------+
# |id |num|st |date      |
# +---+---+---+----------+
# |2  |3.0|a  |2020-01-01|
# |3  |2.0|a  |2020-01-02|
# |4  |1.0|b  |2020-01-04|
# |2  |3.0|a  |2020-01-08|
# |4  |7.0|b  |2020-01-09|
# +---+---+---+----------+
Kafels
  • 3,864
  • 1
  • 15
  • 32