How to optimize code for Simultaneous Multithreading?

Question

Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.

However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.

I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.

Which book or resource should I look at if I want to learn more about this topic? Thank you.

EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.

Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.

Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.

What you are searching for is not code that utilizes hyper threading, which is a marketing term for [SMT](https://en.wikipedia.org/wiki/Simultaneous_multithreading), but for multi threading. There are alot of resources on that topic, but I personally don't know a good one to suggest, so I won't make an answer out of this. You should probably edit your question to ask for multi threading instead, so you get useful answers. — eike, Dec 17 '19 at 07:42
Yes what I meant for hyper threading is SMT in general. And not Intel's hyperthreading. I will edit that — Duke Le, Dec 17 '19 at 07:43
No, you misunderstood: Hyper threading (SMT) is a technique used by CPU manufacturers to allow the execution of more than one thread **per CPU core**. Multithreading just generally means running multiple threads at the same time, which is what you are searching for. SMT is not something you can interact with when programming, it just increases the number of threads the CPU can execute at the same time. — eike, Dec 17 '19 at 07:47
To add to @eike's comment, you can interact with SMT only indirectly... You can structure your code in a way that allows SMT to perform better, but you can't tell the CPU _how_ to "use" SMT on your code. The situation is slightly similar to caches: You can't explicitly load data into your caches, but you can restructure your code in a way that will allow caches to be better filled. — andreee, Dec 17 '19 at 07:52
Yes my question is exactly about how to "structure your code in a way that allows SMT to perform better", not explicitly using SMT — Duke Le, Dec 17 '19 at 07:58
@eike "SMT is not something you can interact with when programming" could you clarify more on this? Do you mean I should program as if SMT doesn't exist? (for example, code as if a 4c-8t CPU is an 8c-8t CPU). — Duke Le, Dec 17 '19 at 08:11
_"I should program as if SMT doesn't exist?"_ For the usual(TM) multithreading application, yes. When optimizing later, you should definitely take SMT into consideration. Sorry I can't give an elaborate answer to this right now, it's been quite a while since I worked with this kind of stuff :-) — andreee, Dec 17 '19 at 08:15
well. that makes things a lot more simple. However, the question is still about "how" and "book or resources to learn about the topic", so if anyone can give an answer to that I would be very appreciated. — Duke Le, Dec 17 '19 at 08:19
@user3192711 one last comment from my side: I've attended [this very good three day course](https://moodle.rrze.uni-erlangen.de/course/view.php?id=387) a couple of years ago, it's about node level performance engineering. In day 2 they discuss SMT (with focus on HPC applications), maybe it can give you a pointer. — andreee, Dec 17 '19 at 08:31

score 1 · Answer 1 · answered Dec 17 '19 at 08:26

1

SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.

So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.

The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.

answered Dec 17 '19 at 08:26

eike

1,314
7
18

"one of them can execute instructions, while the other reads data" this is the type of information I'm looking for in the answers. Do you know any books or resources online that talk more about this? – Duke Le Dec 17 '19 at 08:32
@user3192711 Sadly none that are public. I know the basic principle of SMT from university lectures and have deduced the rest from that. I am sure there are a lot of resources on the general topic, but I don't have any to recommend. When optimizing for performance, [cache locality](https://en.wikipedia.org/wiki/Locality_of_reference#Hierarchial_memory) usually has a relatively big impact without changing the basic idea of the algorithms used. Since that also reduces time wasted waiting for data and is more easily understood and adjusted for, it may be better to look for that first. – eike Dec 17 '19 at 08:42

How to optimize code for Simultaneous Multithreading?

1 Answers1