Getting error while doing Standardization after Window Partitioning of Pyspark Dataframe

Question

Dataframe:

Above is my dataframe, I want to add a new column with value 1, if first transaction_date for an item is after 01.01.2022, else 0. To do this i use the below window.partition code:

windowSpec  = Window.partitionBy("article_id").orderBy("transaction_date")

feature_grid = feature_grid.withColumn("row_number",row_number().over(windowSpec)) \
.withColumn('new_item', 
            when(
              (f.col('row_number') == 1) & (f.col('transaction_date') >= ‘01.01.2022’), 1) .otherwise(0))\
.drop('row_number')

I want to perform clustering on the dataframe, for which I am using VectorAssembler with the below code:

from pyspark.ml.feature import VectorAssembler
input_cols = feature_grid.columns

assemble=VectorAssembler(inputCols= input_cols, outputCol='features')
assembled_data=assemble.transform(feature_grid)

For standardisation;

from pyspark.ml.feature import StandardScaler
scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
display(data_scale_output)

The standardisation code chunk gives me the below error, only when I am using the above partitioning method, without that partitioning method, the code is working fine.

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 182.0 failed 4 times, most recent failure: Lost task 0.3 in stage 182.0 (TID 3635) (10.205.234.124 executor 1): org.apache.spark.SparkException: Failed to execute user defined function (VectorAssembler$$Lambda$3621/907379691

Can someone tell me what am I doing wrong here, or what is the cause of the error ?

`org.apache.spark.SparkException: Failed to execute user defined function` do you have any UDF when you transform your `feature_grid` dataframe? Did you do any action and caching before you standardize your `feature_grid`? — Jonathan Lam, Aug 23 '22 at 16:41
@Jonathan There are no UDF or caching done, when transforming the feature grid, only some basic for loops and data manipulation like dropping columns, type casting and join operations are done. — snigdha mohapatra, Aug 23 '22 at 17:03
As you're just using some typical spark ml pipeline, could you try to cache and call a action function like `show()` or `first()` first before you transform your dataframe by your pipeline? — Jonathan Lam, Aug 24 '22 at 13:28
To my understanding, `VectorAssembler` works only for numeric columns. Your `input_cols` looks like it is taking Article and Transaction Date too. — s510, Aug 24 '22 at 14:48
@Jonathan, Do you mean that I should perform a feature_grid.cache(), feature_grid.show() after the partition operation. I did that and still the error occurs. — snigdha mohapatra, Aug 25 '22 at 10:29
@derFotik No only numeric cols are considered for VectorAssembler, other columns like article_id, transaction_date was dropped before VectorAssembler was executed. — snigdha mohapatra, Aug 25 '22 at 10:31
@snigdhamohapatra which spark version are you using? If you're using old spark version, this error may be caused by null value in the column that you want to assemble. Please check if there is any null row / column. — Jonathan Lam, Aug 26 '22 at 09:04
@JonathanLam I am using spark 3.2.1, and yes it was due to null values in some columns, The error got resoved. Thank you for your help and all suggestions. — snigdha mohapatra, Aug 29 '22 at 21:18
@snigdhamohapatra Great to hear that you solved the problem, i have updated the answer below, please accept it if you think the answer helps :) — Jonathan Lam, Aug 30 '22 at 03:39

score 0 · Accepted Answer · answered Aug 30 '22 at 03:38

0

This error is triggered by the null values in columns, which are assembled when using the spark VectorAssembler. Please fill the null column before transform your dataframe.

answered Aug 30 '22 at 03:38

Jonathan Lam

1,761
2
8
17

Getting error while doing Standardization after Window Partitioning of Pyspark Dataframe

1 Answers1