I have a dataframe as follows:
---------------
id | name |
---------------
1 | joe |
1 | john |
2 | jane |
3 | jo |
---------------
The goal is, if the 'id' column is duplicate, add ascending number to it starting from 1.
In Pandas, I can do it this way:
count_id = df.groupby(['id']).cumcount()
count_num = count_id.replace(0, '').astype(str)
df['id'] += count_num
I tried to use the same logic in PySpark with no success.
The result should be:
id | name |
---------------
1 | joe |
11 | john |
2 | jane |
3 | jo |
---------------
How do I achieve the same in PySpark? Any help is greatly appreciated.