I have a problem with understanding my results regarding Integral algorithm (implemented in OpenCl). I have access to two Intel Xeon E5-2680 v3 , one has 12 cores.
From OpenCl I don't know why but I can see only one device but I can request 12 or 24 cores, so I guess it does not matter if I "see" one or two devices, if 24 cores are used (2 CPUs).
I was running those tasks with max local size = 4096, and minimal global size = 4096, and for 1 CPU and 2 CPU executing time was the same, I was changing global size to 2* 4096, 4* 4096, 8* 4096 and when I reached 16* 4096 global size, 1CPU was slowing down, but 2x CPU was speeding up, and every next global size I changed to bigger than before it stayed that way, 2x CPU was 2x faster than 1x CPU.
I don't understand why from the beginning we can't see advantage of 2x CPU over 1x CPU. What is also important to me, I was collecting power consumption for CPU's, and in that last global size = 8* 4096 when we see the same execution time of 1 and 2 CPUs I can see a bit smaller power consumption for 2 CPUs, and when global size was growing, that 2 CPU consumption was lower than on 1 CPU I guess because of 2x faster time execution, but shouldn't it be equal or bigger than on 1 CPU? What may be important: I checked that always 1 and 2 CPUs have 2.5 Ghz freq, and it is not changing. My questions regarding above are:
Why on smaller global Size's 1 CPU and 2 CPU have equal execution time?
Why on bigger global size's 2 CPU have smaller power consumption.
Why in that one point when Global Size = 8*4096 when we have equal execution times, I have slightly less power consumption with 2 CPUs than 1 CPU.
I need to add that every run was made 10x so those results are not accidental