The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat
stores the features of some samples and each row is a sample. Each data/file*_labels.txt
stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80
million samples.
In Spark
, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py
. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark
, it shows the data
object is a PythonRDD
.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt
is just one file. In my case, there are many files. Is there any RDD
in Spark
to handle this case conveniently? Does it need to merge all 100
files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).