The result can be achieved in 4 steps:
- Split the column
value
in two parts (lines 3-5)
- Group by the first part of
value
(line 6)
- Take the maximum version and the
id
and value
that belongs to this version using max_by (line 7)
- Optional: Drop the intermediate columns and sort the result (lines 8 and 9)
from pyspark.sql import functions as F
df.withColumn('split', F.split('value', '_'))\
.withColumn('value_part1', F.col('split')[0]) \
.withColumn('version', F.col('split')[1]) \
.groupBy('value_part1') \
.agg(F.expr('max(version)'), F.expr('max_by(value, version) as value'), F.expr('max_by(id, version) as id')) \
.select('id', 'value') \
.orderBy('id') \
.show()
Output:
+---+-------+
| id| value|
+---+-------+
| 3|CAR1_V3|
| 5|CAR2_V2|
| 7|CAR3_V2|
+---+-------+
Note: I assume that the version part of value
is unique.