2

I have a pyspark dataframe with 3 columns:

ID, each appearing multiple times; DATE; DELAY, 0 if this bill was payed on time, 1 otherwise.

It's already ordered by ID and DATE. I need to create a column named CONSECUTIVE that shows how many consecutive bills were paid consecutively with DELAY=1 for each ID.

Example of data, and expected result:

ID    | DATE  | DELAY  | CONSECUTIVE
101   | 1     | 1      | 1
101   | 2     | 1      | 2
101   | 3     | 1      | 3
101   | 4     | 0      | 0
101   | 5     | 1      | 1
101   | 6     | 1      | 2
213   | 1     | 1      | 1
213   | 2     | 1      | 2

Is there a way to do it without using Pandas? If so, how do I do it?

mccintra
  • 21
  • 2
  • 1
    Possible duplicate of [Python Spark Cumulative Sum by Group Using DataFrame](https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe) – pault Jul 11 '18 at 20:06
  • See the dupe target- you're looking for something like `df.withColumn('CONSECUTIVE', F.sum('DELAY').over(Window.partitionBy('ID').orderBy('DATE').rangeBetween(Window.unboundedPreceding, 0))` – pault Jul 11 '18 at 20:08

1 Answers1

0

You can do this with 3 transformation with help of window.

from pyspark.sql.window import Window
from pyspark.sql import functions as F

df = sqlContext.createDataFrame([
    (101, 1, 1),
    (101, 2, 1), # dasd
    (101, 3, 0),
    (101, 4, 1)
], ["id", 'date', 'delay'])

window = Window.partitionBy('id').orderBy('date')
last_value = F.last('rank').over(window.rowsBetween(-2, -1))
consecutive = F.when( F.col('delay')==0, 0) \
            .otherwise( F.when(F.col('last_rank').isNull(), 1) \
            .otherwise( F.col('last_rank')+1))

df \
    .withColumn('rank', F.row_number().over(window)) \
    .withColumn('rank', F.when(F.col('delay')!=0, F.col('rank')).otherwise(0)) \
    .withColumn('last_rank', last_value) \
    .withColumn('consecutive', consecutive).show()

results:

+---+----+-----+----+---------+-----------+
| id|date|delay|rank|last_rank|consecutive|
+---+----+-----+----+---------+-----------+
|101|   1|    1|   1|     null|          1|
|101|   1|    1|   2|        1|          2|
|101|   1|    0|   0|        2|          0|
|101|   1|    1|   4|        0|          1|
+---+----+-----+----+---------+-----------+
hamza tuna
  • 1,467
  • 1
  • 12
  • 17