3

I have a Apache Spark 1.6.1 standalone cluster set on a single machine with the following specifications:

CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
RAM: 16GB

If I have the following configuration:

SPARK_WORKER_INSTANCES = 1
SPARK_WORKER_CORES = 3
SPARK_WORKER_MEMORY = 14GB

My questions are:

(A) Is my job using:

  • 3 physical cores for the workers, 1 physical core for the driver: 4 physical cores in total?
  • 2 physical cores and 1 vcore for the workers, 1 physical core for the driver: 3 physical cores in total?
  • 2 physical cores and 1 vcore for the workers, 1 vcore for the driver: 2 physical in total?
  • Any other combination of allocated vcores and physical ones?

(B) Is there a way to set Spark to utilise first only physical cores and If I require more than physical ones, only then, use vcores?

(C) Is there a way to know whether is Spark using physical or vcores?

(D) Is there an official place where I can find information about Spark's behaviour with regards to physical and virtual cores?

Thanks a lot.

User2130
  • 555
  • 1
  • 6
  • 16
  • 1
    https://youtu.be/7ooZ4S7Ay6Y?t=1h30m30s – zero323 Jul 14 '16 at 20:52
  • Yes, for local mode it is threads when setting local[n,*,1]. For standalone it is cpu cores when talking about cores. – User2130 Jul 14 '16 at 21:03
  • You can check for the same information about standalone for example (somewhere around 1:48 if my notes are correct). – zero323 Jul 14 '16 at 21:13
  • 1.49 - "SPARK_WORKER_CORES all that means is how many cores a worker JVM can give out to its underlying executors." (One executor by default per application on Standalone). The answer to this sub-issue I believe is: even if it is a thread speaking on lower level, in standalone, it reserves the whole CPU core for its thread, whereas in local mode, we can have more "threads" than CPU cores. So at the end, I would like to know which core is using (vcore or physical)? BTW, thanks a lot for the youtube video, it is hugely useful in general. I loved it. – User2130 Jul 14 '16 at 21:40
  • It doesn't. Assigning CPU shares is OS job and AFAIK even with cgroups you cannot get higher granularity than a single process. But I don't expect you to believe me :) Video, although slightly outdated, is extremely useful. – zero323 Jul 14 '16 at 21:51
  • 1
    You seems to think that when using hyper-threading, one half the cores is faster than the other half. This is not the case, all logical cores appear the same to the OS. – Kien Truong Jul 14 '16 at 21:58
  • @Dikei I did not think of it at the time but your comment made me get deeper into reading about hyper-threading and computers architecture. And I realised as you commented, vcores or physical cores are all the same for the OS and programs since HT technology is transparent to them. Thanks a lot. – User2130 Jul 20 '16 at 21:15
  • @zero323 ok I checked and tested what you said and you are right. "Cores" in Spark means how many tasks an executor can handle simultaneously. I really don't understand why Spark's documentation is not clear about this. They even mention a lot of times "CPU cores" on the official site, which makes us beginners to misunderstand. I monitored the cpu usage on my PC and when a Spark application is running, in fact, all cores are utilised despite how many "cores" I set on WORKER_CORES. Thank you very much, you motivated me to investigate more on this and I have finally understood. – User2130 Jul 20 '16 at 21:19

0 Answers0