0

Essentially, I have a dataset with thousands of distinct machines (with unique Id's) and variables measuring their operation on a daily basis, as in:

ID|Var1|Var2

A | 99 | 51

A | 76 | 49

B | 40 | 8

B | 33 | 10

My objective is to use pyspark.ml package to perform a standardisation and subsequently a PCA to reduce the number of features to 1, so I can monitor this resulting variable and determine when the machine is operating normally or not.

The issue however, is that these machines are not exposed to the same working conditions, so in order to do that I would require to fit the methods to each machine individually so I can monitor what normal means for that specific asset.

I have attempted to do this by breaking my dataset into multiple datasets (per individual Id's) and executing this approach in a for loop but because the number of different Id's is in the thousands the execution time is not ideal.

Without much success, I have also been trying to use Pyspark window functions in order to fit these methods over a window that partitions by ID, but I cant really get that to work as the fit method uses a dataframe as an input and I've been getting the error: AttributeError: 'DataFrame' object has no attribute 'over'.

I was wondering if anyone would know if it is in fact possible to utilise pyspark.ml to fit methods using window partition functions and maybe provide some code example on how to do that?

0 Answers0