I am trying to understand simultaneous multithreading (SMT) but I have just run into a problem.
Here is what I figured out so far
SMT requires a superscalar processor to work. Both technologies - superscalar and SMT - allow to execute multiple instructions at the same time. Whilst a "simple" superscalar processor requires all instruction executed / issued within one cycle to be part of a single thread, SMT allows to execute instructions of different threads at the same time. SMT is advantageous over superscalar processors because the instructions of a single thread do often have dependency meaning that we cannot execute them all at the same time. Instructions of different threads do not have these dependencies allowing us to execute a larger number of instructions at the same time.
I hope I got that right so far.
Here is the problem I have
SMT is said to be a mix of superscalar processors and interleaved / temporal multithreading. Personally, I cannot see how interleaved multithreading is involved in SMT.
Interleaved multithreading is better than no multithreading. That's because interleaved multithreading allows a context switch when high latency events (e.g. cache misses) occurs. Whilst the information is loaded into the cache, the processor can carry on with a different thread which increases the performance.
I wonder if SMT also makes use of interleaved multithreading. Or, to put it into a question, what happens when high latency events occur in a SMT architecture?
Example of what I was thinking of
Let's assume we have got a 4-way-superscalar SMT processor and there are 5 different threads waiting to be executed. Let's also assume that the instructions of each thread are dependent of the previous instruction in a way that requires to only execute one instruction of each thread at a time.
If there aren't any high latency events, i figure the execution of the instruction could look somehow like that (each number and color correspond to a thread.):
We would just be sequentially executing the first 4 threads using the processor ideally. Thread 5 just needs to wait until another thread is finished.
What I really want to figure out is, what happens if a high latency event occurs. Let's assume that situation to be the same but this time thread one runs into a cache miss at the first instruction. What happens could look somehow like this:
Cache miss without multithreading
We would have to wait until the information from the memory is loaded. Unless we are additionally using interleaving multithreading such as block interleaving with switch-on-cache-miss. Then it could look like one of these:
Cache miss with multithreading
I found pictures only which might suggest that SMT uses some kind of fine grain multithreading but I couldn't find any information to really confirm this.
I would be really thankful if someone could help me to understand how this part of SMT works. This detail is driving me crazy!