0

I am doing PoC on Spark's Map Reduce performance for calculating weighted average over 5000 to 200,000 data and it appears to be very slow. So, just wanted to check whether I am doing something wrong here. Here is my setup details - • No. of Worker nodes: 2 • CPUs: 8 per node (16)

For 5000 orders, it takes around 9 seconds to do all of the following Map reduce operation to calculate weighted average i.e. (n1*v1 + n2*v2 + ....)/(n1+n2+....)

    //Calculation of sum of n*v using Map Reduce
    JavaPairRDD<String, Double> jprMap = javaRDD.mapToPair(new PairFunction<Tuple2<Double, Double>, String, Double>() {
        public Tuple2<String, Double> call(Tuple2<Double, Double> t) { return new Tuple2<String, Double>("Numerator", t._1*t._2); }
    }); 

    JavaPairRDD<String, Double> num = jprMap.reduceByKey(new Function2<Double, Double, Double>(){
        public Double call(Double v1, Double v2) { return v1 + v2; }
    });

    Double numValue = num.values().first();

    // Calculate sum of n using MapReduce
    JavaPairRDD<String, Double> jprMapSum = javaRDD.mapToPair(new PairFunction<Tuple2<Double, Double>, String, Double>() {
        public Tuple2<String, Double> call(Tuple2<Double, Double> t) { return new Tuple2<String, Double>("denominator", t._1); }
    }); 

    JavaPairRDD<String, Double> den = jprMapSum.reduceByKey(new Function2<Double, Double, Double>(){
        public Double call(Double v1, Double v2) { return v1 + v2; }
    });

    Double denValue = den.values().first(); 

    Double weightedAverage = numValue/denValue;

For 200,000 data as well, it takes around 9 seconds. Is this expected behavior? It seems to be very slow. Does Spark works well only for big data (like billions of data)? Is there a way to improve the performance for this kind of calculations?

Pooja Mazumdar
  • 223
  • 2
  • 14
  • try to cache rdds that you're using multiple times in memory (or disk if they're too big), otherwise spark will recalculate them every time. – drstein Aug 09 '16 at 11:41
  • even after caching it takes 6 seconds. is there a way to do these computations in milliseconds? – Pooja Mazumdar Aug 09 '16 at 13:40

0 Answers0