2

The question is in the title but how do you create a new column in Pyspark which counts cumulatively the number of previous repeating values?

For instance:

| Value|
| 0    |
| 0    |
| 5    |
| 5    |
| -1   |
| 0    |
| 0    |
| 0    |

Applying this to the value column would result in a new column of values

| Value | Result
|  0    |  1
|  0    |  2
|  5    |  1
|  5    |  2
|  -1   |  1
|  0    |  1
|  0    |  2
|  0    |  3
pppery
  • 3,731
  • 22
  • 33
  • 46
Jimmy
  • 43
  • 1
  • 5

1 Answers1

2

One solution is to use the difference between row numbers to create groups. And then use the group column to calculate row_number() as result:

from pyspark.sql import functions as F, Window

df = spark.createDataFrame([(0,), (0,), (5,), (5,), (-1,), (0,), (0,), (0,)], ["Value"])

df.withColumn("ID", F.monotonically_increasing_id()) \
    .withColumn("group",
            F.row_number().over(Window.orderBy("ID"))
            - F.row_number().over(Window.partitionBy("Value").orderBy("Value"))
    ) \
    .withColumn("Result", F.row_number().over(Window.partitionBy("group").orderBy("Value"))) \
    .drop("ID", "group")\
    .show()

#+-----+------+
#|Value|Result|
#+-----+------+
#|    0|     1|
#|    0|     2|
#|    5|     1|
#|    5|     2|
#|    0|     1|
#|    0|     2|
#|    0|     3|
#|   -1|     1|
#+-----+------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • Two things. First, this reorders the list where the -1 comes in last. Is there a way to do this without having to resort it? Secondly, the actual list does not have an ID column, I regrettably just added it to make the example have multiple columns. Creating an ID or index column with "monotonicallyIncreasingId" does not create a sequential one because of the partitions. Is there a way to do this without the ID column? – Jimmy Dec 16 '19 at 18:28
  • You are a genius – Jimmy Dec 18 '19 at 20:36