What makes this function run much slower?

Question

I've been trying to make an experiment to see if the local variables in functions are stored on a stack.

So I wrote a little performance test

function test(fn, times){
    var i = times;
    var t = Date.now()
    while(i--){
        fn()
    }
    return Date.now() - t;
} 
ene
function straight(){
    var a = 1
    var b = 2
    var c = 3
    var d = 4
    var e = 5
    a = a * 5
    b = Math.pow(b, 10)
    c = Math.pow(c, 11)
    d = Math.pow(d, 12)
    e = Math.pow(e, 25)
}
function inversed(){
    var a = 1
    var b = 2
    var c = 3
    var d = 4
    var e = 5
    e = Math.pow(e, 25)
    d = Math.pow(d, 12)
    c = Math.pow(c, 11)
    b = Math.pow(b, 10)
    a = a * 5
}

I expected to get inversed function work much faster. Instead an amazing result came out.

Untill I test one of the functions it runs 10 times faster than after testing the second one.

Example:

> test(straight, 10000000)
30
> test(straight, 10000000)
32
> test(inversed, 10000000)
390
> test(straight, 10000000)
392
> test(inversed, 10000000)
390

Same behaviour when tested in alternative order.

> test(inversed, 10000000)
25
> test(straight, 10000000)
392
> test(inversed, 10000000)
394

I've tested it both in the Chrome browser and in Node.js and I've got absolutely no clue why would it happen. The effect lasts till I refresh the current page or restart Node REPL.

What could be a source of such significant (~12 times worse) performance?

PS. Since it seems to work only in some environemnts please write the environment You're using to test it.

Mine were:

OS: Ubuntu 14.04
Node v0.10.37
Chrome 43.0.2357.134 (Official Build) (64-bit)

/Edit
On Firefox 39 it takes ~5500 ms for each test regardless of the order. It seems to occur only on specific engines.

/Edit2
Inlining the function to the test function makes it run always the same time.
Is it possible that there is an optimization that inlines the function parameter if it's always the same function?

My guess is this has something to with garbage collection. Garbage collection will create spikes of used memory before the collector comes by and cleans up all the remains. Does it make a difference if you switch the order of functions around? — somethinghere, Jul 29 '15 at 10:51
Try switching your tests: first run against `inversed` then against `straight`. — robertklep, Jul 29 '15 at 10:51
@robertklep Like I wrote running in alternative order yields the same results — Krzysztof Wende, Jul 29 '15 at 10:52
Yeah, something weird is going on. I'd rather expect 10x *speed-up* (because of JIT or whatever), but *slow-down*? :scratches-head: — Sergio Tulentsev, Jul 29 '15 at 11:02
Confirmed in V8 and Spidermonkey. It happens even if `inversed` and `straight` have exactly the same definition. Possibly invoking more than one function incurs extra overhead. — 1983, Jul 29 '15 at 11:05
I also suspect it must be something Garbage Collection related. But according to V8 documentation local variables are stored on a stack instead of a heap so they should be collected after the end of the function. Not randomly — Krzysztof Wende, Jul 29 '15 at 11:07
Following this question, I asked a new one here and referred back to this post: http://stackoverflow.com/questions/31698747/does-the-js-garbage-collector-clear-stack-memory Interesting question btw. — html_programmer, Jul 29 '15 at 11:13
@KimGysen. I'm not an expert in that matter. But I always thought that heap get's searched for clearing, and the stack just gets popped out from all of the local variables used in the function after it ends it's execution. — Krzysztof Wende, Jul 29 '15 at 11:19
@KrzysztofWende That's also what I've been reading. I didn't find any indication to assume the stack memory would get cleared by the bc, nor do I see an immediate reason why this would be useful. I'm also no expert on this, but the answer to this question could be interesting for any programmer. Might ultimately be worth a bounty. — html_programmer, Jul 29 '15 at 11:24
I added a comment [here](http://stackoverflow.com/questions/31698747/does-the-js-garbage-collector-clear-stack-memory); basically, `function(){ var foo = new SomeObject() }` allocates the object from the heap and stores a reference to it in the stack variable foo. Upon function return, the memory for foo is freed, and the reference count to the object decreased. Then GC takes care of destroying the object. — Kenney, Jul 29 '15 at 11:25
Perhaps `fn` is inlined until the point where it can take on more than one value. — 1983, Jul 29 '15 at 11:49
Let's wait for some engineer engaged in V8 to see if it's true — Krzysztof Wende, Jul 29 '15 at 11:52

Vyacheslav Egorov · Accepted Answer · 2015-07-29T13:07:34.723

Once you call test with two different functions fn() callsite inside it becomes megamorphic and V8 is unable to inline at it.

Function calls (as opposed to method calls o.m(...)) in V8 are accompanied by one element inline cache instead of a true polymorphic inline cache.

Because V8 is unable to inline at fn() callsite it is unable to apply a variety of optimizations to your code. If you look at your code in IRHydra (I uploaded compilation artifacts to gist for your convinience) you will notice that first optimized version of test (when it was specialized for fn = straight) has a completely empty main loop.

V8 just inlined straight and removed all the code your hoped to benchmark with Dead Code Elimination optimization. On an older version of V8 instead of DCE V8 would just hoist the code out of the loop via LICM - because the code is completely loop invariant.

When straight is not inlined V8 can't apply these optimizations - hence the performance difference. Newer version of V8 would still apply DCE to straight and inversed themselves turning them into empty functions

so the performance difference is not that big (around 2-3x). Older V8 was not aggressive enough with DCE - and that would manifest in bigger difference between inlined and not-inlined cases, because peak performance of inlined case was solely result of aggressive loop-invariant code motion (LICM).

On related note this shows why benchmarks should never be written like this - as their results are not of any use as you end up measuring an empty loop.

If you are interested in polymorphism and its implications in V8 check out my post "What's up with monomorphism" (section "Not all caches are the same" talks about the caches associated with function calls). I also recommend reading through one of my talks about dangers of microbenchmarking, e.g. most recent "Benchmarking JS" talk from GOTO Chicago 2015 (video) - it might help you to avoid common pitfalls.

"Megamorphic" functions not being inlined is an insight I was not aware of. Thank you for making me more aware of that class of optimization in the engine (including "monomorphic" types). When the function becomes megamorphic, and can no longer apply optimizations, does that mean there is no inline cache to place in the global cache and there is a miss when the lookup occurs? — Travis J, Aug 04 '15 at 20:58
I am not sure I understand the question. Functions can't become megamorphic - it's a property of the each individual operation inside a function, like property access `o.x` or function call `f()`. When *function call* becomes megamorphic V8 simply *can't inline the call* - that's all there is to that. — Vyacheslav Egorov, Aug 05 '15 at 15:36

Luaan · Answer 2 · 2015-07-29T11:28:12.457

17

You're misunderstanding the stack.

While the "real" stack indeed only has the Push and Pop operations, this doesn't really apply for the kind of stack used for execution. Apart from Push and Pop, you can also access any variable at random, as long as you have its address. This means that the order of locals doesn't matter, even if the compiler doesn't reorder it for you. In pseudo-assembly, you seem to think that

var x = 1;
var y = 2;

x = x + 1;
y = y + 1;

translates to something like

push 1 ; x
push 2 ; y

; get y and save it
pop tmp
; get x and put it in the accumulator
pop a
; add 1 to the accumulator
add a, 1
; store the accumulator back in x
push a
; restore y
push tmp
; ... and add 1 to y

In truth, the real code is more like this:

push 1 ; x
push 2 ; y

add [bp], 1
add [bp+4], 1

If the thread stack really was a real, strict stack, this would be impossible, true. In that case, the order of operations and locals would matter much more than it does now. Instead, by allowing random access to values on the stack, you save a lot of work for both the compilers, and the CPU.

To answer your actual question, I'm suspecting neither of the functions actually does anything. You're only ever modifying locals, and your functions aren't returning anything - it's perfectly legal for the compiler to completely drop the function bodies, and possibly even the function calls. If that's indeed so, whatever performance difference you're observing is probably just a measurement artifact, or something related to the inherent costs of calling a function / iterating.

edited Jul 29 '15 at 11:28

answered Jul 29 '15 at 11:22

Luaan

62,244
7
97
116

5

This is more an answer to why my test was stupid in the first place and not really answer to the actual question, but thanks for clarification anyway ;) – Krzysztof Wende Jul 29 '15 at 11:23
@KrzysztofWende Yeah, I know I'm skirting the question here, but I thought it would help explain why your test is essentially irrelevant. It would be awful if you *really* only had `push` and `pop` - noöne would use the stack, really. – Luaan Jul 29 '15 at 11:25
How does it look then from data structure perspective? If it allows random order than it's not really a stack. Are these just random memory pointers floating around where they fit? If yes why is it called a stack after all? – Krzysztof Wende Jul 29 '15 at 11:28
@KrzysztofWende It's still useful - the basic semantics are still there. When you allocate a new local, you `push`. When you leave the scope, you `pop` *all* the locals of that scope. When you call a method, you `push` the return address and the arguments. When you return, you `pop` the return address and the arguments. However, for performance (and simplicity) reasons, you can address any of those arguments (for example) separately at any time in the scope where they are valid. It's also possible to pass "reference" to a local in a higher scope, so that callees can modify caller's locals. – Luaan Jul 29 '15 at 11:31
1

@KrzysztofWende So yes, it still *is* a stack - but it's a stack that also allows random access. It doesn't allow random allocations or deallocations - that's still exclusively about `push` and `pop` (and to be fair, operations like "trim", when you discard multiple stack frames at once). Think of it as an enhanced stack if you will. – Luaan Jul 29 '15 at 11:34
@KrzysztofWende : It's more a stack of "frames", each containing where to return to, incoming arguments, and local variables. It's not uncommon (in some languages, I'm not certain in JavaScript) for arguments to be at "known" negative offsets from the function's stack frame pointer and its locals to be at positive offsets. So it's more like a stack of objects and the type/layout of the object on top of the stack is in one-to-one correspondence with the function that is executing. – Eric Towers Jul 29 '15 at 23:42
Another important point to remember is this: In the context of C and other low-level languages, "the stack" and "the heap" are ultimately "just" regions of memory. The terms stack and heap are intended to describe how that memory is managed (allocated, deallocated), not how it is accessed. Some HLLs compile to bytecode which really does use a stack to store some information, but this is typically only used for short-term things like evaluating a single arithmetic expression. – Kevin Jul 30 '15 at 01:35

Bergi · Answer 3 · 2015-07-29T13:03:18.947

Inlining the function to the test function makes it run always the same time.
Is it possible that there is an optimization that inlines the function parameter if it's always the same function?

Yes, this seems to be exactly what you are observing. As already mentioned by @Luaan, the compiler likely drops the bodies of your straight and inverse functions anyway because they are not having any side effects, but only manipulating some local variables.

When you are calling test(…, 100000) for the first time, the optimising compiler realises after some iterations that the fn() being called is always the same, and does inline it, avoiding the costly function call. All that it does now is 10 million times decrementing a variable and testing it against 0.

But when you are calling test with a different fn then, it has to de-optimise. It may later do some other optimisations again, but now knowing that there are two different functions to be called it cannot inline them any more.

Since the only thing you're really measuring is the function call, that leads to the grave differences in your results.

An experiment to see if the local variables in functions are stored on a stack

Regarding your actual question, no, single variables are not stored on a stack (stack machine), but in registers (register machine). It doesn't matter in which order they are declared or used in your function.

Yet, they are stored on the stack, as part of so called "stack frames". You'll have one frame per function call, storing the variables of its execution context. In your case, the stack might look like this:

[straight: a, b, c, d, e]
[test: fn, times, i, t]
…

How variables are stores is highly implementation specific question especially once the function in question is optimized - as local variables might simply disintegrate into nothing. — Vyacheslav Egorov, Jul 29 '15 at 13:09
@VyacheslavEgorov: Yeah, I didn't mean that literally. Only conceptually they are part of the lexical environment which is part of the stack frame, where they are stored in an actual implementation is a completely different thing. — Bergi, Jul 29 '15 at 13:28

What makes this function run much slower?

3 Answers3

Linked