12

Say I run a simple single-threaded process like the one below:

public class SirCountALot {
    public static void main(String[] args) {
        int count = 0;
        while (true) {
            count++;
        }
    }
}

(This is Java because that's what I'm familiar with, but I suspect it doesn't really matter)

I have an i7 processor (4 cores, or 8 counting hyperthreading), and I'm running Windows 7 64-bit so I fired up Sysinternals Process Explorer to look at the CPU usage, and as expected I see it is using around 20% of all available CPU.

Graph showing 20% CPU usage across all cores

But when I toggle the option to show 1 graph per CPU, I see that instead of 1 of the 4 "cores" being used, the CPU usage is spread all over the cores:

Graph showing erratic CPU usage on each core totaling around 20% usage

Instead what I would expect is 1 core maxed out, but this only happens when I set the affinity for the process to a single core.

Graph showing most of recent CPU usage to be confined to first core

Why is the workload split over the separate cores? Wouldn't splitting the workload over several cores mess with the caching or incur other performance penalties?

Is it for the simple reason of preventing overheating of one core? Or is there some deeper reason?

Edit: I'm aware that the operating system is responsible for the scheduling, but I want to know why it "bothers". Surely from a naive viewpoint, sticking a (mostly*) single-threaded process to 1 core is the simpler & more efficient way to go?

*I say mostly single-threaded because there's multiple theads here, but only 2 of them are doing anything:

Screenshot showing number of threads from Eclipse Screenshot showing number of threads in Process Explorer process properties

Caspar
  • 7,039
  • 4
  • 29
  • 41
  • 2
    Small nitpick; saying this is a single threaded process won't be correct. JVM internally spawns multiple threads for housekeeping purposes like finalizers, garbage collectors etc. It is quite possible that to get real work done by each thread, the JVM threads are mapped to real h/w threads, which again might explain the spread. – Sanjay T. Sharma Dec 13 '11 at 08:10
  • 1
    I guess Caspar meant the _non-daemon_ threads. – Santosh Dec 13 '11 at 08:17
  • @SanjayT.Sharma Yes, I simplified a little bit and probably should have given a sample program in a non-managed language ;) However like I said, I strongly suspect it isn't the JVM doing this (and if it is mapping JVM -> HW threads and that is responsible, why is the mapping constantly changing?) – Caspar Dec 13 '11 at 08:30
  • @Santosh yes exactly, I meant threads which aren't idle 99% of the time – Caspar Dec 13 '11 at 08:31

2 Answers2

20

The OS is responsible for scheduling. It is free to stop a thread and start it again on another CPU. It will do this even if there is nothing else the machine is doing.

The process is moved around the CPUs because the OS doesn't assume there is any reason to continue running the thread on the same CPU each time.

For this reason I have written a library for lock threads to a CPU so it won't move around and won't be interrupted by other threads. This reduces latency and improve throughput but does tire up a CPU for that thread. This works for Linux, perhaps you can adapt it for Windows. https://github.com/peter-lawrey/Java-Thread-Affinity/wiki/Getting-started

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • It interrupts a process many times per second. (100/s on Linux) It is more work to have to remember where a process was last running and try to assign that CPU in preference to any other, instead it assigns it to the next free CPU. – Peter Lawrey Dec 13 '11 at 08:23
  • I guess I wan't clear enough; I'm aware that the OS does the scheduling, and you can see in the 2nd graph I've set the afinity for the process so it uses only the first core. What I want to know is *why* does the OS schedule the single "active" thread over all the available cores? – Caspar Dec 13 '11 at 08:23
  • The converse question is why would to keep assigning a thread back to the same CPU instead of just assigning it to the next free CPU (which is what it does) Using round robin works well no matter how many CPUs are busy. Assigning to the same CPU each time could leave one CPU very busy (with two threads running on it) while other CPUs are idle. – Peter Lawrey Dec 13 '11 at 08:25
  • Not really off-topic but "side-topic" question: What are use-cases where you really need something like that? Financial trading applications? For what did you need it or was it just a private case study project? – Fabian Barney Dec 13 '11 at 08:33
  • @Peter Okay (my previous comment was written before your first comment). I thought that since processor time was cheap & memory access relatively expensive, wouldn't the performance hit from a cache miss be worse than that from keeping track of which core to schedule a process on? – Caspar Dec 13 '11 at 08:35
  • The scheduler is designed to spread the workload around, which makes sense for a machine which has more active threads than cpus. In the case where you have more cores than critical thread, its a good reason to use affinity for those threads you have identified as critical. (Which is why I wrote a library to do it). AFAIK, you can't stop the threads from being interrupted (but you can reduce it) – Peter Lawrey Dec 13 '11 at 08:38
  • I've noticed this behaviour too. It's very easy to just not do this - 'if the thread was running before the interrupt, and it is decided to run it after the interrupt, run it on the same core', so it's a deliberate design choice, I've put it down to an attempt to ensure cache-coherence across all the cores for those apps that fail to implement barriers correctly, at least in the long-term, (ie. hundreds of ms). Has anyone measured, estimated or seen any papers on the cost of this strategy? – Martin James Dec 13 '11 at 12:18
  • I can't think of any reason to 'spread the workload around' where there are more ready threads than cores and it woud be very easy for the scheduler to determine that the number of ready threads is less than the number of cores and so not move the ready threads around. There is a reason for moving these thread/s around, I just don't know for sure what it is yet. – Martin James Dec 13 '11 at 12:28
  • 1
    There is a good reason to keep a continuously ready-to-run thread on the same core as much as possible -- everything that thread needs is hot on the core that's running that thread. That includes the L1 and L2 caches, the branch prediction buffer, the TLB, and so on. Changing cores means having to repopulate all those caches. (This really only applies if the thread is always ready to run. If the core does something else because that thread couldn't keep running, the caches will be cold even if you assign the thread back to the same core.) – David Schwartz Dec 15 '16 at 17:20
1

I would also expect this could well be done on purpose by the CPU and OS so as to try and spread the thermal load on the CPU die...

So it would rotate the (unique/single) thread from core to core.

And that could admittedly be an argument against trying to fight this too hard (especially as, in practice, you often will see better improvements by simply tuning / improving the app itself anyway)

Camlin
  • 11
  • 2
  • Interesting. Do you know that Windows/Linux do this for sure, or is it a hypothesis? (Also, welcome to stackoverflow : ) – Leeor Apr 02 '16 at 08:13
  • I have seen this clearly happening on OSX and Windows. I would expect the same for Linux but never specifically tried to verify it. – Camlin Apr 26 '16 at 20:40