Below I have one Pyspark dataset (test), a function (func) and a window function. I wish to make func a pandas_udf function and apply it on the window I defined below.
Func takes a list of values. That list I want to be the four values of the window function (feature: 'value') However, I can't find a solution to this issue.
Hope anyone here can point me in the right direction.
Also, here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.pandas_udf.html it states: "Currently, pyspark.sql.types.ArrayType of pyspark.sql.types.TimestampType and nested pyspark.sql.types.StructType are currently not supported as output types." Perhaps taht means that this anit possible to do. In that case, what alternatives do a developer have?
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from typing import Iterator
from pyspark.sql.types import ArrayType, IntegerType, LongType, DoubleType, FloatType, StringType
from pyspark.sql.window import Window
import pyspark.sql.functions as F
# Create a Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()
# My function I wise to use on a window
def func(list_of_integers):
for i in range(len(lista)):
if lista[i] > 0.25:
lista[i]=0.25
return list_of_integers
# Sample dataframe
test = spark.createDataFrame([("USA", '2022-02-01', 'A', 0.1),
("USA", '2022-02-01', 'B', 0.2),
("USA", '2022-02-01', 'C', 0.6),
("USA", '2022-02-01', 'D', 0.1),
("USA", '2022-03-01', 'A', 0.25),
("USA", '2022-03-01', 'B', 0.25),
("USA", '2022-03-01', 'C', 0.25),
("USA", '2022-03-01', 'D', 0.25),
("France", '2022-02-01', 'A', 0.1),
("France", '2022-02-01', 'B', 0.15),
("France", '2022-02-01', 'C', 0.55),
("France", '2022-02-01', 'D', 0.2),
("France", '2022-03-01', 'A', 0.2),
("France", '2022-03-01', 'B', 0.36),
("France", '2022-03-01', 'C', 0.14),
("France", '2022-03-01', 'D', 0.3),
],
("country", "date", "class", 'value'))
window = Window.partitionBy('country', 'date', 'class')