0

I am trying to implement something similar to the below SparkR code into pyspark.

df <- createDataFrame(mtcars)
# Partition by am (transmission) and order by hp (horsepower)
ws <- orderBy(windowPartitionBy("am"), "hp")
# Lag mpg values by 1 row on the partition-and-ordered table
out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am)

Does anyone have any idea how to do this on pyspark dataframe?

  • 1
    check this out: https://stackoverflow.com/questions/31857863/how-to-use-window-functions-in-pyspark https://sparkbyexamples.com/pyspark/pyspark-window-functions/ – Nikunj Kakadiya Dec 10 '21 at 11:20

1 Answers1

0
from pyspark.sql.window import Window
from pyspark.sql.functions import lag  
    
#Create dataframe    
data = (("A", 10), ("B", 20), ("A", 30), ("C", 15))
columns = ["Name", "Number"]
    
df = sqlContext.createDataFrame(data, columns)

#Define the window        
win =  Window.partitionBy("Name").orderBy("Number")

df_lag = df.withColumn("lag", lag("Number", 1, -5).over(win))
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 10 '21 at 15:56