Remove duplicate keeping the last occurance of value in resultant pyspark dataframe

Question

Im new to pyspark dataframes,I have input dataframe in below format :

ID	Value
1	CAR1_V1
2	CAR1_V2
3	CAR1_V3
4	CAR2_V1
5	CAR2_V2
6	CAR3_V1
7	CAR3_V1

Now I need to convert it in below format :

ID	Value
3	CAR1_V3
5	CAR2_V2
7	CAR3_V1

Not only duplicates need to remove also have to keep the last occurance of values(like data between underscore have to consider separately.

enter image description here

Could you please provider some example data (input and expected output)? — werner, Sep 14 '22 at 17:46
Please add the data to the question and format it as table. At the moment, it's quite hard to figure out what you are trying to achieve — werner, Sep 14 '22 at 17:50
[Please do not upload images of code/data/errors when asking a question.](//meta.stackoverflow.com/q/285551) Besides that, its (at least for me) still unclear how the output is calculated. Should the data be parsed? — werner, Sep 14 '22 at 18:35
Added data in a tabular format, Not only duplicates need to remove also have to keep the last occurance of values (like data between underscore have to consider separately. — dua, Sep 15 '22 at 01:50

score 0 · Accepted Answer · answered Sep 15 '22 at 06:55

The result can be achieved in 4 steps:

Split the column value in two parts (lines 3-5)
Group by the first part of value (line 6)
Take the maximum version and the id and value that belongs to this version using max_by (line 7)
Optional: Drop the intermediate columns and sort the result (lines 8 and 9)

from pyspark.sql import functions as F

df.withColumn('split', F.split('value', '_'))\
    .withColumn('value_part1', F.col('split')[0]) \
    .withColumn('version', F.col('split')[1]) \
    .groupBy('value_part1') \
    .agg(F.expr('max(version)'), F.expr('max_by(value, version) as value'), F.expr('max_by(id, version) as id')) \
    .select('id', 'value') \
    .orderBy('id') \
    .show()

Output:

+---+-------+
| id|  value|
+---+-------+
|  3|CAR1_V3|
|  5|CAR2_V2|
|  7|CAR3_V2|
+---+-------+

Note: I assume that the version part of value is unique.

Remove duplicate keeping the last occurance of value in resultant pyspark dataframe

1 Answers1