0

I have a dataframe as follows:

---------------
id   | name   |
---------------
 1   | joe    |
 1   | john   |
 2   | jane   |
 3   | jo     |
---------------

The goal is, if the 'id' column is duplicate, add ascending number to it starting from 1.

In Pandas, I can do it this way:

count_id = df.groupby(['id']).cumcount()
count_num = count_id.replace(0, '').astype(str)
df['id'] += count_num

I tried to use the same logic in PySpark with no success.

The result should be:


id   | name   |
---------------
 1   | joe    |
 11  | john   |
 2   | jane   |
 3   | jo     |
---------------

How do I achieve the same in PySpark? Any help is greatly appreciated.

Raj
  • 613
  • 3
  • 9
  • 23

1 Answers1

1

To replicate that output, you can use a Window to get the row_number for each id, and then concat to add it to the id.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy("id").orderBy("name")
df.withColumn("row_number", f.row_number().over(w)-1)\
    .withColumn(
        "id", 
        f.when(
            f.col("row_number") > 0, 
            f.concat(f.col("id"), f.col("row_number"))
        ).otherwise(f.col("id"))
    )\
    .drop("row_number")\
    .show()
#+---+----+
#| id|name|
#+---+----+
#|  1| joe|
#| 11|john|
#|  3|  jo|
#|  2|jane|
#+---+----+

Note: This will convert the id column into a StringType column if it isn't already.


In order to get the output you originally stated in the question as the desired result, you'd have to add a group count column in addition to calculating the row number. Only concatenate the row number if the count is greater than one.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy("id")
df.withColumn("count", f.count("*").over(w))\
    .withColumn("row_number", f.row_number().over(w.orderBy("name")))\
    .withColumn(
        "id", 
        f.when(
            f.col("count") > 1, 
            f.concat(f.col("id"), f.col("row_number"))
        ).otherwise(f.col("id"))
    )\
    .drop("count", "row_number")\
    .show()
#+---+----+
#| id|name|
#+---+----+
#| 11| joe|
#| 12|john|
#|  3|  jo|
#|  2|jane|
#+---+----+
pault
  • 41,343
  • 15
  • 107
  • 149