1

I am trying to add a new column to my spark DF. I understand the following code can be used:

df.withColumn("row",monotonically_increasing_id)

But my use case is:

Input DF:

col value
  1
  2
  2
  3
  3
  3

Output DF:

col_value      identifier
  1               1
  2               1
  2               2
  3               1
  3               2
  3               3

Any suggestions on getting this with monotonically_increasing or rowWithUniqueIndex.

data_person
  • 4,194
  • 7
  • 40
  • 75

1 Answers1

4

Given your requirement, one approach would be to use row_number Window function:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  1, 2, 2, 3, 3, 3
).toDF("col_value")

val window = Window.partitionBy("col_value").orderBy("col_value")
df.withColumn("identifier", row_number().over(window)).
  orderBy("col_value").
  show
// +---------+----------+
// |col_value|identifier|
// +---------+----------+
// |        1|         1|
// |        2|         1|
// |        2|         2|
// |        3|         1|
// |        3|         2|
// |        3|         3|
// +---------+----------+
Leo C
  • 22,006
  • 3
  • 26
  • 39