2

I am trying to understand simultaneous multithreading (SMT) but I have just run into a problem.

Here is what I figured out so far
SMT requires a superscalar processor to work. Both technologies - superscalar and SMT - allow to execute multiple instructions at the same time. Whilst a "simple" superscalar processor requires all instruction executed / issued within one cycle to be part of a single thread, SMT allows to execute instructions of different threads at the same time. SMT is advantageous over superscalar processors because the instructions of a single thread do often have dependency meaning that we cannot execute them all at the same time. Instructions of different threads do not have these dependencies allowing us to execute a larger number of instructions at the same time.

I hope I got that right so far.

Here is the problem I have

SMT is said to be a mix of superscalar processors and interleaved / temporal multithreading. Personally, I cannot see how interleaved multithreading is involved in SMT.

Interleaved multithreading is better than no multithreading. That's because interleaved multithreading allows a context switch when high latency events (e.g. cache misses) occurs. Whilst the information is loaded into the cache, the processor can carry on with a different thread which increases the performance.

I wonder if SMT also makes use of interleaved multithreading. Or, to put it into a question, what happens when high latency events occur in a SMT architecture?

Example of what I was thinking of

Let's assume we have got a 4-way-superscalar SMT processor and there are 5 different threads waiting to be executed. Let's also assume that the instructions of each thread are dependent of the previous instruction in a way that requires to only execute one instruction of each thread at a time.

If there aren't any high latency events, i figure the execution of the instruction could look somehow like that (each number and color correspond to a thread.):

Without cache miss

We would just be sequentially executing the first 4 threads using the processor ideally. Thread 5 just needs to wait until another thread is finished.

What I really want to figure out is, what happens if a high latency event occurs. Let's assume that situation to be the same but this time thread one runs into a cache miss at the first instruction. What happens could look somehow like this:

Cache miss without multithreading

We would have to wait until the information from the memory is loaded. Unless we are additionally using interleaving multithreading such as block interleaving with switch-on-cache-miss. Then it could look like one of these:

Cache miss with multithreading

I found pictures only which might suggest that SMT uses some kind of fine grain multithreading but I couldn't find any information to really confirm this.

I would be really thankful if someone could help me to understand how this part of SMT works. This detail is driving me crazy!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jan
  • 21
  • 2

1 Answers1

2

The term SMT usually refers to Out-of-Order (OoO) processors. OoO processors already have all this machinery for handling dependencies between instructions, a physical register file that is a lot larger than the architectural register file, and so forth. In such a processor adding SMT is relatively simple, essentially the processor just needs support for the extra per-thread architectural state, and then to tag each instruction specifying which HW thread it belongs to, after that the OoO execution machinery handles all the queued instructions just like before. So the OoO machinery handles dependencies between instructions just as usual, handles instructions which are waiting for e.g. a cache miss etc etc. And instructions from different, or the same, threads are free to executive on whichever execution pipelines are free and able to execute the instruction, regardless of which thread it belongs to, subject to all the dependencies having been satisfied. And yes, instructions from multiple threads can execute concurrently on a superscalar core.

Interleaved multithreading OTOH, are what you'll find in in-order processors. These processors lack all the sophisticated (and power hungry!) OoO logic, so they must do something simpler, unless they want to morph into OoO processors with all the costs that entail. Thus they choose a simpler form of multithreading, where at each point in time the processor only executes instructions from a single thread. So even if the processor is a superscalar one, at each point in time it can only execute instructions belonging to a single thread. After some time (in some processors as short as every cycle) the processor switches to another thread. And if a thread is blocked, e.g. waiting for a cache miss to be resolved, then the processor bypasses that thread and runs the other threads. From the OS perspective all the threads are simultaneously running though, this happens at a much finer granularity than OS threads.

janneb
  • 36,249
  • 2
  • 81
  • 97
  • Thank you for your answer. The combination of SMT and OoO makes sense to me. Yet I still don't understand what happens when e.g. a cache miss occurs. Do we just accept the loss of performance or does the scheduler / issue logic cancel this operation in order to execute another instruction in the meantime? I think of something like using coarse grain multithreading (block interleaving) on top of SMT and OoO. Otherwise the instruction with the cache miss just blocks at least one of the execution units for no reason. Also it would block all issued instructions having a real dependency on it. – Jan Jun 13 '23 at 19:01
  • @Jan Modern high end OoO processors have instruction window sizes in the hundreds, do extensive speculation with branch prediction, value speculation and whatnot. As for an instruction that blocks on a cache miss, AFAIU such instructions are split into micro-ops, and the stalled load goes to the memory unit which is capable of hundreds of outstanding loads. And only once all the dependencies have been satisfied is an instruction submitted for execution. – janneb Jun 13 '23 at 19:42
  • OoO is not necessary, e.g., Niagara is a massively SMT architecture, yet it's a simple in-order. – SK-logic Jun 14 '23 at 15:07
  • @SK-logic: Niagara isn't an SMT processor per the terminology used in this answer, it's an implementation of interleaved multithreading. It switches threads every cycle, and instructions from multiple threads cannot be issued in the same clock cycle. – janneb Jun 14 '23 at 16:29
  • @janneb you're right, I for some reason assumed Niagara was dual-issue in-order, while it's apparently single-issue. Anyway, it's quite possible to imagine a wide in-order superscalar architecture that would issue multiple threads in a single clock cycle (and I built such CPUs in the past, as a proof of concept, with up to 8 threads issued/retired in one cycle). – SK-logic Jun 15 '23 at 08:42
  • @SK-logic Could you please tell me if you also used some kind of temporal multithreading within your concept? I could think of something like block interleaving. Whenever there's a cache miss (or similar) there could be a "context switch" to use the time needed for loading the information into the cache more efficiently. I don't see any reason why SMT (without OoO) and temporal multithreading shouldn't be combined. – Jan Jun 16 '23 at 05:42
  • @Jan the core I built was meant for compute acceleration, geared towards OpenCL-like model, with throughput being far more important than latency, so it basically context switch all the time - whatever instructions came into scheduler queue first will be served, mixing all of the running threads. But you're right, it is indeed possible to switch SMT threads on an in-order architecture on pipeline stalling events - having more than one execution units, and having one of them stalled can be a hint to the scheduler to start issueing the next thread. – SK-logic Jun 16 '23 at 06:30