I want to use the standardscaler pyspark.ml.feature.StandardScaler
over window of my data.
df4=spark.createDataFrame(
[
(1,1, 'X', 'a'),
(2,1, 'X', 'a'),
(3,9, 'X', 'b'),
(5,1, 'X', 'b'),
(6,2, 'X', 'c'),
(7,2, 'X', 'c'),
(8,10, 'Y', 'a'),
(9,45, 'Y', 'a'),
(10,3, 'Y', 'a'),
(11,3, 'Y', 'b'),
(12,6, 'Y', 'b'),
(13,19,'Y', 'b')
],
['id','feature', 'txt', 'cat']
)
w = Window().partitionBy(..)
I can do this over the whole dataframe by calling the .fit
& .transform
methods. But not on the w
variable which we use generally like F.col('feature') - F.mean('feature').over(w)
.
I can transform all my windowed/grouped data into separate columns, put it into a dataframe and then apply StandardScaler over it and transform back to 1D
. Is there any other method ? The ultimate goal is to try different scalers including pyspark.ml.feature.RobustScaler
.