I am pretty new to Spark
and I am using a cluster mainly for paralellizing purpose. I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing.
I want to use a 10 node cluster and parallelize the processing. I know the block size is more than 100MB
, and I tried to repartition the textFile
. If I understand well, this repartition
method increases the number of partitions:
JavaRDD<String> input = sc.textFile(args[0]);
input.repartition(10);
The issue is that when I deploy to the cluster, only a single node is effectively processing. How can I manage to process the file in parallel?
Update 1: here's my spark-submit
command:
/usr/bin/spark-submit --master yarn --class mypackage.myclass --jars
myjar.jar
gs://mybucket/input.txt outfile
Update 2: After the partition, there are basically 2 operations :
JavaPairRDD<String, String> int_input = mappingToPair(input);
JavaPairRDD<String, String> output = mappingValues(int_input, option);
output.saveAsTextFile("hdfs://...");
where mappingToPair(...)
is
public JavaPairRDD<String, String> mappingToPair(JavaRDD<String> input){
return input.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String line) {
String[] arrayList = line.split("\t", 2);
return new Tuple2(arrayList[0], arrayList[1]);
}
});
}
and mappingValues(...)
is a method of the following type:
public JavaPairRDD<String,String> mappingValues(JavaPairRDD<String,String> rdd, final String option){
return rdd.mapValues(
new Function<String, String>() {
// here the algo processing takes place...
}
)
}