how to convert a PythonRDD with sparse data into dense PythonRDD

Question

I want to use StandardScaler to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply StandardScaler, we should first convert it into dense types.

trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
trainLabel = trainData.map(lambda x: x.label)
trainFeatures = trainData.map(lambda x: x.features)
valLabel = valData.map(lambda x: x.label)
valFeatures = valData.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)

# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)    

# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...

# using the scaled data, i.e., trainData1 and valData1 to train a model
...

The above code has errors. I have two questions:

how to convert the sparse PythonRDD trainFeatures into dense tpye that can be as the inputs of StandardScaler?
How to merge trainLabel and trainFeatures_scaled into a new LabeledPoint that can be used to train a classifier (e.g. random forest)?

I still find any documents or references about this.

Depending on number of features converting data to dense can be a rally bad idea. — zero323, May 21 '16 at 04:45
@zero323, thanks for your suggestion! But if we didn't do that, how to scale the sparse data loaded by SVM? — mining, May 21 '16 at 04:46
@zero323, in fact, our original is dense, I save them into libsvm format for using `MLUtils.loadLibSVMFile` to load them. I think it might be more reasonable to save it into a format that is compatible the `DataFrame`. — mining, May 21 '16 at 04:50
If original data is dense then using libSVM format can roughly double the size of the output which may not be the best idea either. But my only point is that you should be careful when making data dense. In the worst case scenario these can get quite large — zero323, May 21 '16 at 04:57
@zero323, yes, I should save the data into other format, maybe it is more convenient to handle them. Taking your valuable suggestions, I'm trying to use the original `map-reduce` to compute the means and standard deviations for each feature dimension. Don't use the StandardScaler. — mining, May 21 '16 at 05:03

score 2 · Accepted Answer · answered May 21 '16 at 04:36

2

To convert to dense map using toArray:

dense = valFeatures.map(lambda v: DenseVector(v.toArray()))

To merge zip:

valLabel.zip(dense).map(lambda (l, f): LabeledPoint(l, f))

answered May 21 '16 at 04:36

77a299fa

36
1

Thanks! it worked! Could you please tell me where could I learn this knowledge about this? I really didn't find the documentation. – mining May 21 '16 at 04:48
1

By the way in `Scala` you can use [SparseVector.toDense](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector) method. – Alberto Bonsanto May 21 '16 at 11:47

how to convert a PythonRDD with sparse data into dense PythonRDD

1 Answers1