I want to use StandardScaler
to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply StandardScaler
, we should first convert it into dense types.
trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath)
trainLabel = trainData.map(lambda x: x.label)
trainFeatures = trainData.map(lambda x: x.features)
valLabel = valData.map(lambda x: x.label)
valFeatures = valData.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)
# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)
# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...
# using the scaled data, i.e., trainData1 and valData1 to train a model
...
The above code has errors. I have two questions:
- how to convert the sparse PythonRDD
trainFeatures
into dense tpye that can be as the inputs ofStandardScaler
? - How to merge
trainLabel
andtrainFeatures_scaled
into a new LabeledPoint that can be used to train a classifier (e.g. random forest)?
I still find any documents or references about this.