1

Im new to pyspark dataframes,I have input dataframe in below format :

ID Value
1 CAR1_V1
2 CAR1_V2
3 CAR1_V3
4 CAR2_V1
5 CAR2_V2
6 CAR3_V1
7 CAR3_V1

Now I need to convert it in below format :

ID Value
3 CAR1_V3
5 CAR2_V2
7 CAR3_V1

Not only duplicates need to remove also have to keep the last occurance of values(like data between underscore have to consider separately.

enter image description here

dua
  • 13
  • 4
  • Could you please provider some example data (input and expected output)? – werner Sep 14 '22 at 17:46
  • Please add the data to the question and format it as table. At the moment, it's quite hard to figure out what you are trying to achieve – werner Sep 14 '22 at 17:50
  • [Please do not upload images of code/data/errors when asking a question.](//meta.stackoverflow.com/q/285551) Besides that, its (at least for me) still unclear how the output is calculated. Should the data be parsed? – werner Sep 14 '22 at 18:35
  • Added data in a tabular format, Not only duplicates need to remove also have to keep the last occurance of values (like data between underscore have to consider separately. – dua Sep 15 '22 at 01:50

1 Answers1

0

The result can be achieved in 4 steps:

  • Split the column value in two parts (lines 3-5)
  • Group by the first part of value (line 6)
  • Take the maximum version and the id and value that belongs to this version using max_by (line 7)
  • Optional: Drop the intermediate columns and sort the result (lines 8 and 9)
from pyspark.sql import functions as F

df.withColumn('split', F.split('value', '_'))\
    .withColumn('value_part1', F.col('split')[0]) \
    .withColumn('version', F.col('split')[1]) \
    .groupBy('value_part1') \
    .agg(F.expr('max(version)'), F.expr('max_by(value, version) as value'), F.expr('max_by(id, version) as id')) \
    .select('id', 'value') \
    .orderBy('id') \
    .show()

Output:

+---+-------+
| id|  value|
+---+-------+
|  3|CAR1_V3|
|  5|CAR2_V2|
|  7|CAR3_V2|
+---+-------+

Note: I assume that the version part of value is unique.

werner
  • 13,518
  • 6
  • 30
  • 45