I want to remove consecutive duplicates within a subset of columns from the dataframe. I found a solution on how to do it but only for single columns here
Having a dataframe like this:
test_df = spark.createDataFrame([
(2,3.0,"a", "2020-01-01"),
(2,6.0,"a", "2020-01-02"),
(3,2.0,"a", "2020-01-02"),
(4,1.0,"b", "2020-01-04"),
(4,9.0,"b", "2020-01-05"),
(4,7.0,"b", "2020-01-05"),
(2,3.0,"a", "2020-01-08"),
(4,7.0,"b", "2020-01-09")
], ("id", "num","st", "date"))
##############
id num st date
2, 3.0, "a" "2020-01-01"
2, 6.0, "a" "2020-01-02"
3, 2.0, "a" "2020-01-02"
4, 1.0, "b" "2020-01-04"
4, 9.0, "b" "2020-01-05"
4, 7.0, "b" "2020-01-05"
2, 3.0, "a" "2020-01-08"
4, 7.0, "b" "2020-01-09"
I want to remove consecutive duplicates in an specific set of columns [id,st] keeping the first record (ordered by date) when consecutive cases appear. If two samples appear on same day and cant be properly ordered it can be choosen at random. The result would look like:
##############
id num st date
2, 3.0, "a" "2020-01-01"
3, 2.0, "a" "2020-01-02"
4, 1.0, "b" "2020-01-04"
2, 3.0, "a" "2020-01-08"
4, 7.0, "b" "2020-01-09"
How could I do that?