0

i want to run spark wordcount application on four different file at same time.

i have standalone cluster with 4 worker nodes, each node having one core and 1gb memory.

spark works in standalone mode... 1.4worker nodes 2.1 core for each worker node 3.1gb memory for each node 4.core_max set to 1

./conf/spark-env.sh

**

export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=4

**

i have executed using .sh file

./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R  txt1 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R  txt2 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R  txt3 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R  txt4

is this a correct way to submit application parallelly ?

when one application running it takes 2sec like that(only using one core) when 4 application given simultaneously then each application takes more than 4sec ... how do i run spark application on different files parallely?

1 Answers1

0

When you submit multiple jobs to a spark cluster, the Application master / resource-manager automatically schedules the jobs in parallel. (as spark is on top of yarn).

You dont need to do any extra scheduling for that.

And for the scenario you have shown, you could have read all different files in a single spark job.

And believe me, due to Spark's lazy evaluation / DAG optimizations and RDD transformations (logical/physical plans), reading of different files and word-count will go in parallel.

You can read all files in single job as:

sc.wholeTextFiles("<folder-path>")

The folder-path is the parent directory where all files reside.

Raktotpal Bordoloi
  • 1,009
  • 8
  • 15