How does the size of a binary influence the execution speed

Question

How does the size of a binary influence the execution speed? Specifically I am talking about code written in ANSI-C translated into machine language using the gnu or intel compiler. The target platform for the binary are modern computers with intel or AMD multi-core CPU's running a Linux operating system. The code performs numerical computations possibly in parallel using openMP and the binary could have several mega bytes.

Note that the execution time will in any case be much larger than the time needed to load code and libraries. I think of very specific codes used to solve large systems of ordinary differential equations for simulations of kinetic equations which are typically CPU-bound for a moderate system size but can also become memory-bound.

I am asking whether small binary size should be a design criterion for highly efficient code or if I can always give preference to explicit code (which eventually repeats code blocks which could be implemented as functions) and compiler optimizations such as loop unrolling etc.

I am aware of profiling technics and how I can apply them to specific problems, but I wonder to which extent general statements can be made.

This is way too broad. There's no 1:1 relation between the code size and its speed or performance. — , Oct 17 '12 at 15:07
In general it's not a a problem unless your executable has a very short run-time, or the binary is so large that it gets paged out. — Paul R, Oct 17 '12 at 15:07
If you are asking a question this broad you almost certainly aren't working in a domain where this level of optimization is worth thinking about. — djechlin, Oct 17 '12 at 15:08
If you're so concerned about performance, why not use a profiler? If size is indeed a factor, the profiler should help you to figure that out. — Chris Eberle, Oct 17 '12 at 15:11
@H2CO3 I tried to be a bit more specific in my edit. But in a sense the question is to which extent general statements can be made - what are the parameters? — highsciguy, Oct 17 '12 at 15:22
@highsciguy the peoblem is that essentially no 'general statements' can be made. — , Oct 17 '12 at 15:23
@djechlin I am completely certainly working in a domain where high optimization is in order. The question is when binary size might be issue. — highsciguy, Oct 17 '12 at 15:23
@highsciguy: when your profiler reports multiple cache misses on the addresses that contain the code, I suppose. With a secondary effect that even if the code never gets evicted, the more cache it occupies the less there is available for everything else. What sort of generalizations were you expecting, "it's good if the code is smaller than the available cache"? ;-) — Steve Jessop, Oct 17 '12 at 15:42
Well, if this is the only constraint 1) I will be fine and 2) the statement is as general as it can ever get, or no? I just wondered if there could be performance losses if the CPU has to step over large blocks of machine code. Say, my numerical core is a loop enclosing conditionals out of which only some will apply such that the corresponding block of code has to be executed. The different conditional blocks can be vastely — highsciguy, Oct 17 '12 at 15:49
... vastly different in size, depending on my general coding strategy. Will it always be irrelevant over how big blocks the machine has to step as long as I am within the cache size? — highsciguy, Oct 17 '12 at 15:57
@highsci - the amount of instructions executed is the most important factor, not the size of the executable. A small executable that loops millions of times will take longer than a huge executable that runs straight thru once, so size is not a very good measure. Of course a small executable without loops will be the fastest... — Bo Persson, Oct 17 '12 at 16:17
@highsciguy: the compiler has its own heuristics for how big it thinks functions should get, and the primary time that it applies them is when deciding what code to inline. The compiler might let you tune that somewhat (`-finline-limit` on gcc), but I don't think it's possible to form a simple, general rule. To the extent that it is simple (and maybe a bit further), the compiler probably already applies it... — Steve Jessop, Oct 17 '12 at 17:28

Gil · Answer 1 · 2012-10-18T15:12:26.907

4

CPUs have caches.

As compared to the CPU speed, access to system memory is slow. That's why CPUs have caches (made of ultra-fast memory).

Each level of CPU cache has a different size and speed.

Therefore, to achieve the largest possible speed, it is of critical importance to avoid cache refreshes at the lowest levels (unfortunately that's also the smallest caches).

Both code and data will force a cache refresh. So size matters in both cases.

For example: Code may generate a cache miss when you jump or call. Data may generate a cache miss when you load a variable at a remote address.

There are other issues like alignment which can greatly influence the speed but nothing costs more than a CPU cache miss (reloading a CPU cache involves CPU cores synchronization, and that's not an easy task: it can take something like 250 CPU cycles!).

Without entering into platform-specific details, that's what can be said.

Conclusion: keep it simple. And small is beautiful.

edited Oct 18 '12 at 15:12

answered Oct 17 '12 at 16:15

Gil

3,279
1
15
25

2

I don't think that there is a simple (probably not even a complicated) relationship to be drawn between the size of an executable and the frequency with which instruction cache misses occur. If there is, this answer does not provide a convincing argument for it. – High Performance Mark Oct 17 '12 at 16:36
I am not talking about machine instructions but rather code. Code is much more than mere instructions: you have calls, jumps, etc. that will trigger a cache refresh. I will add this to the reply above. – Gil Oct 17 '12 at 16:46
@Gil: a call/jump doesn't necessarily trigger a cache refresh, the target of the jump might already be in cache. Especially if it's in the minority of "hot" code for the program. It's possible that you're conflating instruction cache misses with pipeline issues, though. Certain jumps (although not all) will flush the pipeline even if they don't miss cache. I'm sure there are rules for when that happens on "modern computers with intel or AMD multi-core CPU", I just don't know them :-) – Steve Jessop Oct 17 '12 at 17:33
A book could be written on the subject to expose every single case - and this would not make better programmers. Remember WHAT the question was - that's what I answered. – Gil Oct 18 '12 at 15:10

score 4 · Answer 2 · answered Oct 17 '12 at 16:23

4

The CPU is only ever executing one part of the code, so it's the content of the code, and how much moving around within it you do, that determines speed.

If you have 10Mb of code, and the first 9Mb is only executed once on startup, then it doesn't matter if that 9Mb is slow, or if it's 90Mb or 90kb. If the CPU spends 99.99% of its time in some small, tight loop doing some very efficient calculations then it will be fast, if it has to run through 100,000 lines of code again and again, it will probably be much slower.

Optimisation is about seeing where the CPU spends most of its time and making that code as efficient as possible in the number of CPU cycles taken to get to the answer. Sometimes that could mean adding a load of extra "prep" code outside it to make the main part's job easier/faster.

In some systems, binary size is of major concern (EG embedded devices) but in others it's almost completely irrelevant.

See also: http://www.codeproject.com/Articles/6154/Writing-Efficient-C-and-C-Code-Optimization

answered Oct 17 '12 at 16:23

John U

2,886
3
27
39

The link you mention provides harmful advice. Don't follow it. – Alexandre C. Oct 18 '12 at 08:05
Care to explain that comment? – John U Oct 18 '12 at 09:50
2

Yep. Some things it explains are just plain wrong (using small size integers), others are irrelevant (everything about switch statements that your compiler will know better than you) or machine dependent. Almost everything mentioned is going to impact code quality and thus must be considered premature optimization. If you want to know how to write fast code, learn to know what is slow and not *with your particular architecture*, and don't use the blanket statements of the article. Modern architectures suffer much more from cache misses and other very low level stuff than everything there. – Alexandre C. Oct 18 '12 at 10:27
Also one comment at the bottom of the article states that this article is copied almost verbatim from a 1998 (!) document about optimizing code for the ARM architecture. Most of this stuff is handled better from your compiler, the remaining problems usually stem from more difficult things (cache misses, false sharing, etc) that must be diagnosed after the code is written. – Alexandre C. Oct 18 '12 at 10:34
Sorry, @AlexandreC, I think your answer is wrong and his is right. Optimizing for binary size is *in general* probably not worth your time at all if you're trying to improve execution speed. I note that no one is providing specific examples where this trick was known to work, hmmm? – Tom Swirly Sep 28 '15 at 22:37

How does the size of a binary influence the execution speed

2 Answers2