I am trying to understand the difference in speed between my spark-submit and spark shell jobs. I start the shell or submit with the same resource allocations but I seem to be getting very different performance. When I run it in the shell it take ~10min vs. hr+ with spark submit. Then my question is, are the number of tasks shown in the progress bar of the REPL the same as the number of executors running in spark submit? I see very different numbers for each and I am wonder if I am doing something wrong.
In the shell I start it with
--executor-cores 5 \
--executor-memory 16g \
--driver-memory 230g \
--conf "spark.driver.maxResultSize=100g" \
--conf "spark.network.timeout=360s
And I see 950 concurrent tasks
... pandas_df = intent_dict_rdd.map(lambda x: Row(**x)).toDF().toPandas()
[Stage 1:==============================> (19503 + 950) / 31641]
I do spark submit with the same resource allocation I only see 189 executors
18/07/19 23:44:25 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180719234425-0001/189 on worker-20180719233757-10.0.108.198-33953 (10.0.108.198:33953) with 5 cores
18/07/19 23:44:25 INFO StandaloneSchedulerBackend: Granted executor ID app-20180719234425-0001/189 on hostPort 10.0.108.198:33953 with 5 cores, 16.0 GB RAM
I am using 10x m5.24xlarge machines so that is 96 cores and 384GB ram each. That is a total of 960 cores which looks a lot more like the number of tasks I see. The number of executors look a lot more like 960/5 cores each. Am I focusing on the wrong thing? Is there any other explanation for bad performance of spark submit vs. spark shell?