0

Below I have one Pyspark dataset (test), a function (func) and a window function. I wish to make func a pandas_udf function and apply it on the window I defined below.

Func takes a list of values. That list I want to be the four values of the window function (feature: 'value') However, I can't find a solution to this issue.

Hope anyone here can point me in the right direction.

Also, here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.pandas_udf.html it states: "Currently, pyspark.sql.types.ArrayType of pyspark.sql.types.TimestampType and nested pyspark.sql.types.StructType are currently not supported as output types." Perhaps taht means that this anit possible to do. In that case, what alternatives do a developer have?


from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from typing import Iterator
from pyspark.sql.types import ArrayType, IntegerType, LongType, DoubleType, FloatType, StringType
from pyspark.sql.window import Window
import pyspark.sql.functions as F


# Create a Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# My function I wise to use on a window 
def func(list_of_integers):
 for i in range(len(lista)):
   if lista[i] > 0.25:
     lista[i]=0.25
 return list_of_integers

# Sample dataframe
test = spark.createDataFrame([("USA", '2022-02-01', 'A', 0.1),
                     ("USA", '2022-02-01', 'B', 0.2),
                      ("USA", '2022-02-01', 'C', 0.6),
                      ("USA", '2022-02-01', 'D', 0.1),
                      ("USA", '2022-03-01', 'A', 0.25),
                     ("USA", '2022-03-01', 'B', 0.25),
                      ("USA", '2022-03-01', 'C', 0.25),
                      ("USA", '2022-03-01', 'D', 0.25),
                      ("France", '2022-02-01', 'A', 0.1),
                     ("France", '2022-02-01', 'B', 0.15),
                      ("France", '2022-02-01', 'C', 0.55),
                      ("France", '2022-02-01', 'D', 0.2),
                      ("France", '2022-03-01', 'A', 0.2),
                     ("France", '2022-03-01', 'B', 0.36),
                      ("France", '2022-03-01', 'C', 0.14),
                      ("France", '2022-03-01', 'D', 0.3),
                     ], 
                       ("country", "date", "class", 'value'))

window = Window.partitionBy('country', 'date', 'class') 
Henri
  • 1,077
  • 10
  • 24

0 Answers0