What is the cost of creating large local arrays on the stack in C and C++?

Question

Let's say I have a function called from within a tight loop, that allocates a few large POD arrays (no constructors) on the stack in one scenario, vs. I allocate the arrays dynamically once and reuse them in each iteration. Do local arrays add run-time cost or not?

As I understand it, allocating local POD variables comes down to shifting the stack pointer, so it shouldn't matter much. However, few things come to mind that may potentially affect the performance:

Checking for stack overflow - who and when does these checks, how often? On some systems stacks can grow automatically, but again, I know very little about this.
Cache considerations: is the stack treated in a special way by the CPU cache, or it's no different from the rest of data?
Are variadic arrays any different with respect to the above? Say, for constant-sized arrays the stack can be somehow preallocated (or pre-computed by the compiler?), whereas for variadic ones something else is involved that adds run-time cost. Again I have no idea how this works.

Pick a language. They are different languages with very different compilers. And specify your platform — kfsone, Sep 26 '16 at 22:30
@kfsone: There's no standard syntax for them (yet), but it's usually possible to get variable-length automatic storage via `_alloca()` — Ben Voigt, Sep 26 '16 at 22:37
@kfsone clang (possibly gcc too) allows me to easily mix features from C and C++ standards, though in some cases you should suppress compiler warnings. — mojuba, Sep 26 '16 at 22:39
@BenVoigt: `_alloca()` is a platform-dependent C function and not standard. It also has different semantics than a VLA. — too honest for this site, Sep 26 '16 at 22:41
@mojuba: If you compile as C++, it **is** C++! Just beacsue they share some syntax elements does not mean they have identical semantics. Pick one language! — too honest for this site, Sep 26 '16 at 22:42
What is a "variadic array"? This is nothing defined by the C nor the C++ standard. — too honest for this site, Sep 26 '16 at 22:43
@Olaf, see my comment above. I do whatever my compiler allows me to, which is clang. In fact it allows me to mix C, C++ and Objective-C in one module. To my knowledge GCC can do this as well (though with a different flavour of ObjC). — mojuba, Sep 26 '16 at 22:44

Ben Voigt · Answer 1 · 2016-09-26T22:37:51.750

Checking for stack overflow -- typically the prologue produced by the compiler will walk through the space to be used with a stride corresponding to page size. This guarantees that if the OS is standing ready to extend the stack, that an access to the guard page triggers that OS logic before any accesses occur to lands beyond.

Cache -- The stack is not treated in any special way, but you're likely to get more hits because of locality to space used for spilling registers, saving return addresses, etc which make the stack hot in cache. But if your stack usage is large enough, the part that's already in cache will represent only a tiny fraction. Also, whatever part of the stack has been used recently by another function may be hot as well.

Variable length arrays / arrays with runtime bound -- not really that different. The compiler will have to compute the needed size beforehand, but touching all the pages and adjusting the stack pointer won't magically become more expensive. Exception: unrolling of the loop touching pages will be affected by the fact that the number of pages isn't constant, but this is unlikely to make any difference.

Note that there are a few platforms with dedicated separate registers to be used for return addresses and spilling -- on such the note about these operations making automatic storage hot in cache do not apply.

The cache is much more complex. There is no simple "the stack will be hot in cache". That also depends on associativity. It even becore worse allocating large arrays on the stack. Also alignment can degrade performance when using the stack. Without proper profiling on modern platforms specific assumptions about behaviour are more speculation than anything else. — too honest for this site, Sep 26 '16 at 23:03

score 1 · Answer 2 · answered Sep 26 '16 at 22:38

1

The only performance hit you'll take is if the stack memory hasn't been mapped yet. Creating the virtual-to-physical mappings can take some time, but only the first time you use the memory. Note that you'd also have to pay that price if, for example, you created a large array of POD on the heap using new or malloc() or any variant thereof, and the memory pages you get haven't been mapped yet, either.

Stack overflow checking is likely to be a SIGSEGV if you overrun the stack. Whether or not the stack grows automatically depends on the OS, and in some cases the thread that's using the stack in question, as a process can have more than one thread and therefore more than one stack. In general, the original thread of the process has a stack that grows automatically, up to some limit, while threads started by the process have a fixed-size stack. How the stack grows isn't as important as the fact they have to grow - the growth can be a significant performance hit.

So it will in general be a lot faster to use stack memory instead of heap memory - as long as you pay the price up front and "touch" each page you need to use for the stack to ensure it "exists" and has a virtual-to-physical mapping. But the trade is that stack memory can only be used by the thread that's running on that stack, and that size is likely limited much more so than the heap. That thread may allow access to data on its stack from other threads, but it's that thread that has to retain control of its own stack.

answered Sep 26 '16 at 22:38

Andrew Henle

32,625
3
24
56

"So it will in general be a lot faster to use stack memory" - pure speculation without a true foundation. It depends on the actual application. Indeed properly and reasonably allocated dynamic memory can be much faster. Without a specific use-case it is impossible to say. – too honest for this site Sep 26 '16 at 23:07
@Olaf "pure speculation without a true foundation" Do you really think merely bumping the stack pointer to obtain "more" stack is not going to be "in general ... a lot faster" than going through the `malloc()` or `new` call stack with its locking, locating and parceling out a suitable chunk of memory, and returning it while releasing the lock(s) obtained? Even if it's "properly and reasonably allocated", it isn't going to be faster than adding some value to the contents of the stack pointer register. "But the stack may need to grow" Well, so might the heap. – Andrew Henle Sep 27 '16 at 09:38
1) The C standard does not enforce a specific algorithm. 2) This is irrelevant if you pre-alloc the blocks and e.g. use a pool. 3) If processing time on that buffer is dominant, alignment, etc., becomes more important. And a lot of other issues (RAM banks, cache trashing, etc). I did not explicitly refer to the allocation. Then the stack may be in differerent memory, typically faster, but size-limited, etc. OP does not even specify his target arch. – too honest for this site Sep 27 '16 at 09:44
*The C standard does not enforce a specific algorithm.* But it **does** specify a *set of library functions*. Again, how can calling *functions* be anywhere near as fast as adding to a *register*? *This is irrelevant if you pre-alloc the blocks and e.g. use a pool.* You can preallocate stack memory, too. Search the air about you - I suspect you'll note a visible wrinkled pucker. – Andrew Henle Sep 27 '16 at 09:53
Interestingly you concentrate on the one-time overhead and completely ignore the - for a very large array - more relevant other penalties. Btw. function call need not call a function - once compiled. – too honest for this site Sep 27 '16 at 09:59

What is the cost of creating large local arrays on the stack in C and C++?

2 Answers2