How to create new column in pyspark where the conditional depends on the subsequent values of a column?

Question

I have the following pyspark dataframe. As shown in the picture below I need to create column Value2 from column Value1. The issue is that for all the rows with the same value of time, the value in column Value2 needs to be the first value corresponding to this time in Value1. Therefore, if you look at the picture for all the rows for which the time is 16:07 the value needs to be 100. I know I need to use some form of conditional but I could not work out how to do it so I was wondering if someone could point me in the right direction.

data={
    ‘Name’:[‘John’,’Kate’,’William’,’Andrew’,’Nicole’,’Nicola’,’Gavin’,’Gabriel’, ‘John’,’Kate’,’William’,’Andrew’,’Nicole’,’Nicola’,’Gavin’,’Gabriel’],
    ’Value1’:[10,20,50,60,100,200,250,300,450,520,570,600,630,670,690,700,720],
    ’Time’:[‘ 15/06/2020  16:05:00’, ‘ 15/06/2020  16:05:00’, ‘ 15/06/2020  16:05:00’, ‘ 15/06/2020  16:06:00’, ‘ 15/06/2020  16:07:00’, ‘ 15/06/2020  16:07:00’, ‘ 15/06/2020  16:08:00’, ‘ 15/06/2020  16:09:00’, ‘ 15/06/2020  16:10:00’, ‘ 15/06/2020  17:20:00’, ‘ 15/06/2020  17:21:00’, ‘ 15/06/2020  17:22:00’, ‘ 15/06/2020  17:22:00’, ‘ 15/06/2020  17:22:00’, ‘ 15/06/2020  17:22:00’, , ‘ 15/06/2020  17:25:00’, , ‘ 15/06/2020  17:26:00’}

df=pd.DataFrame(data)
df_spark=spark.createDataFrame(df)

Arjoon · Accepted Answer · 2020-10-15T14:02:45.620

1

You have 2 options:

Use a window function to extract the first value/minimum value in Value 1. Note: If this takes too long, use option 2. Window functions are not performant in general and should be avoided on large dataframes.

The second option is create an aggregate table and join that back into your dataframe. This is the more performant method if your dataset is large but requires 2 steps instead. The query for the aggregate table could look something like this: spark.sql("SELECT Name, Time, Min(Value1) FROM Table GROUP BY Name, Time").createOrReplaceTempView("Aggregate_Table"). And the query for your final table would be: spark.sql("SELECT a.*, b.Time AS Value2 FROM Table a INNER JOIN Aggregate_Table b ON a.Name = b.Name AND a.Time = b.Time")

edited Oct 15 '20 at 14:02

answered Oct 15 '20 at 13:56

Arjoon

180
8

Hi @Arjoon, can you share some external resource/reference on ` Window functions are not performant in general and should be avoided on large dataframes.` – A.B Oct 15 '20 at 14:24
Couldn't find a quick and straightforward reference but I'll try to explain here. Without any partitions on the dataframe, a window function would cause a shuffle in the data. In large dataframes, this can be a memory intensive task that may take a long time depending on the overall Spark query plan and resources available. If a dataframe is partitioned by the same column that the PARTITION clause calls in the window function, this task would be performant as described here: https://stackoverflow.com/questions/54332942/spark-windowing-function-vs-group-by-performance-issue – Arjoon Oct 15 '20 at 15:29
Thankyou for explaining :) I wanted to confirm that if partitioned properly, windows arent that bad, which is cofirmed by the answer you shared as well. – A.B Oct 15 '20 at 17:06

score 1 · Answer 2 · answered Oct 15 '20 at 14:01

1

Try the window functions. You get the min value1 in the window defined by the column 'time':

    from pyspark.sql import Window
    window = Window.partitionBy('Time')
    df_spark.withColumn('Value2', min('Value1').over(window))

answered Oct 15 '20 at 14:01

Nico Arbar

166
1
8

How to create new column in pyspark where the conditional depends on the subsequent values of a column?

2 Answers2