0

I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.

I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?

Thanks!

Priyank
  • 1,513
  • 1
  • 18
  • 36

1 Answers1

2

Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.

In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:

spark.read("jdbc")
  .option("url", url)
  .option("dbtable", "table")
  .option("user", user)
  .option("password", password)
  .option("numPartitions", numPartitions)
  .option("partitionColumn", "<partition_column>")
  .option("lowerBound", 1)
  .option("upperBound", 10000)
  .load()

That will parallel the queries from the databases to 10,000/numPartitions results of each query.

About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).

Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.

ShemTov
  • 687
  • 3
  • 8