Convert a JavaRDD String to JavaRDD Vector

Question

I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;

import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;




public class Trial {
    public void start() throws InstantiationException, IllegalAccessException,
    ClassNotFoundException {

        run();
    }


    private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
        JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());

        System.out.println(mat.mean());


    }

    private List<Vector> Seq(Vector dv) {
        // TODO Auto-generated method stub
        return null;
    }


    public static void main(String[] args) throws Exception {

        Trial trial = new Trial();
        trial.start();
    }
}

The program is running without any error but i'm not able to get anything when trying to run it on spark-machine. Can anyone tell me whether the conversion of string RDD to Vector RDD is correct.

My csv file consist of only one column which are floating numbers

score 1 · Answer 1 · edited Feb 03 '16 at 19:51

1

The null in this flatMap invocation might be a problem:

JavaRDD<Vector> datamain = data.flatMap(null);

edited Feb 03 '16 at 19:51

Barett

5,826
6
51
55

answered Feb 03 '16 at 16:52

Marek Dudek

171
1
13

what should be the solution? – Anshul Kalra Feb 03 '16 at 19:46
Hard to tell, cause I don't know what You're trying to do with it. I know nothing about `MultivariateStatisticalSummary`. Try running it within unit test, it's is easy, You'll get error message. For sure You'll need provide some function for flatMap but what kind I do not know. – Marek Dudek Feb 03 '16 at 19:49

score 0 · Answer 2 · answered Feb 04 '16 at 09:48

I solved my answer by changing the code to this

JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
            public Vector call(String s){
                String[] sarray = s.trim().split("\\r?\\n");
                double[] values = new double[sarray.length];
                for (int i = 0; i < sarray.length; i++) {
                  values[i] = Double.parseDouble(sarray[i]);
                  System.out.println(values[i]);
                }
                return Vectors.dense(values);  
                }
            }
        );

score 0 · Answer 3 · answered Mar 02 '17 at 20:19

Assuming your trial.csv file looks like this

1.0
2.0
3.0

Taking your original code from your question a one line change is required with Java 8

SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());

System.out.println(mat.mean());

Prints 2.0

Convert a JavaRDD String to JavaRDD Vector

3 Answers3