How to access this kind of data in Spark

Question

The data is stored in the following forms:

    data/file1_features.mat
    data/file1_labels.txt
    data/file2_features.mat
    data/file2_labels.txt
    ...
    data/file100_features.mat
    data/file100_labels.txt

Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.

In Spark, how to access this data set?

I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:

    data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
    (trainingData, testData) = data.randomSplit([0.7, 0.3])

I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.

    PythonRDD[32] at RDD at PythonRDD.scala:48

The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).

i see there are two type of file one with .mat extension and another with .txt extension...do you want to load all files into single rdd for processing? or want to load only txt/mat file? — Shashi, May 18 '16 at 19:57
@Shashi, yes, I want to load both types of data. The `*.mat` files are the features and the `*.txt` files are the labels. If I understand correctly, I think the data has been sharded. Thus, I wonder if we should write some simple interface to control the `*.mat` files (e.g. using h5py) to load them into numpy array, then feed them into the RDD in Spark. Then in the pyspark, we can use the RDD. — mining, May 18 '16 at 23:05

score 1 · Accepted Answer · answered May 18 '16 at 19:26

1

Simply point

   dir = "<path_to_data>/data"
   sc.textFile(dir)

Spark automatically picks up all of the files inside that directory

answered May 18 '16 at 19:26

WestCoastProjects

58,982
91
316
560

thank you! I also want to load the `*.mat` files. Maybe I should store the features into `*.txt` format. Your solution is a good start. – mining May 18 '16 at 23:07
Does Spark already support reading `MATLAB` files? I think we should first convert the MATLAB matrix into `*.txt` format. – mining May 18 '16 at 23:40
No - i did not "get" your meaning. Semantics of the file are up to you to provide. My point was only that all of the files - regardless of extension - would get sucked in. You will need to ensure compatibility of the format of the file and the intended usage by the spark worker app/code. – WestCoastProjects May 18 '16 at 23:54
I think I didn't catch your ideas completely. I'm not familiar with Spark. I think I should get the basic knowledge about RDD. After searching, I find this post [http://blog.madhukaraphatak.com/matfile-to-rdd/] and this post [http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd] might be related to this question. If I understand correctly, we should first read the batch data into memory, then merge them into a large RDD. But this seems to not right for RDD. With my limited information on Spark, I think RDD might be just some file keys in a map. – mining May 19 '16 at 00:08
You do not need to worry about merging: spark takes care of it. You should however do a pre-processing on the `.mat` files to get them into proper structure *before* the RDD operation. That assumes they are few. If a lot you may consider a separate spark step that does the pre-processing of just the .mat files: but that in case you would need to have the .mat files in a separate directories from the .txt – WestCoastProjects May 19 '16 at 00:11
Thanks! I'll try that. Now I decide to first convert to the features (i.e., the `*.mat` files) into `*.txt` format. As the features contains multiple different features, I decide to save them into the following format `data/features/feature1/file1.txt`, `data/features/feature2/file1.txt`, ..., `data/features/feature1/file100.txt`. And for labels, I save them into `data/labels/file1.txt`, `data/labels/file2.txt`, ..., `data/labels/file100.txt`. Then I'll try the `sc.textFile('data/features/')` to load them into RDD. – mining May 19 '16 at 00:33
But this will cost a lot of disk space to save those features because of the `*.txt` file format. – mining May 19 '16 at 00:35
@mining Spark is not matlab - so we need to make concessions. In any case disk space is doubtful to be the constrained resource here: cpu and/or ram are more likely, with networking the third likely liability. – WestCoastProjects May 19 '16 at 00:38
Yes, you're right. Thanks very much for your comments and suggestions! – mining May 19 '16 at 00:42

score 1 · Answer 2 · answered May 18 '16 at 20:00

1

If you want load specific file type for processing then you can use regular expression for loading files into RDD.

dir = "data/*.txt"

sc.textFile(dir)

Spark will all files ending with txt extension.

answered May 18 '16 at 20:00

Shashi

2,686
7
35
67

thank you! I also want to load the `*.mat` files. Maybe I should store the features into `*.txt` format. Your solution is a good start. – mining May 18 '16 at 23:08

How to access this kind of data in Spark

2 Answers2