13

I was going through Eloquent JavaScript (again) and came across exercise "Chess Board" of Chapter 2. I had my one decent version of solution written back in the day when I was first time reading it, and another version of solution provided at the Elequent Javascript website. I'm one of the newbies that wanna be super-efficient programmers with only one question in their head: "Can I make it work a tad faster or smaller in anyway?"

So, during my search on the web few months ago, I came across a question on Stack Overflow, regarding for loop vs while loop on basis of performance. Since in that thread it was mentioned for loops are slower than while and loops with decrementing iterator are faster so I rewrote the code for better performance.

Here's the new version with for replaced with while and conditions edited for decrementing:

console.time("looping");
var gridSize = 5000, str = '', i = gridSize, j;
while (i--) {
  j = gridSize;
  while (j--) {
    if ((i - j) % 2 === 0)
      str += " ";
    else
      str += "#";
  }
  str += "\n";
}

//console.log(str);
console.timeEnd("looping");

But to my surprise this code had even worse performance. Then, after a while I found out that if ((i - j) % 2 === 0) was the main culprit, replacing minus sign with plus reduced the total time of execution to ~ 750ms

//Checked on NODE.js = v6.11.2
Book version of code         -->    893.76 ms
while loop with subtraction  -->    1562.43 ms
while loop with addition     -->    749.62 ms


//firefox Benchmark v54.0 (64-bit) (OS Ubuntu 16.04)
Book version of code         -->    725.10 ms
while loop with subtraction  -->    1565.29 ms
while loop with addition     -->    601.12 ms

Why is subtraction having such huge impact on total execution time?

Edit 01 6:30 AM (GMT) 8th August

After looking at @jaromandaX answer I'm Pretty sure that it is not the subtracting thats slowing down this loop, its the modulo of negative Number. Again I wanna know what makes modulo of negative number slower

Syed Huzaifa Hassan
  • 776
  • 1
  • 6
  • 22
  • must be a V8 thing, I'm guessing you tried in nodejs as well as Chrome browser - because firefox doesn't show any difference - although, it's far far slower!! - i.e. on my PC it's 10 times slower than Chrome!! – Jaromanda X Aug 08 '17 at 04:48
  • @jaromanda-x its not about V8 or spiderMonkey, they both are state of the art, one supersedes other on certain things, but this major difference in performance is just crazy, 50-60 ms is acceptable but its more than double, and I checked it on mozilla as well uploading that result too. – Syed Huzaifa Hassan Aug 08 '17 at 04:59
  • 1
    It would be interesting to see what the XOR/AND combination yields: `if ((i ^ j) & 1 === 0)` – SBS Aug 08 '17 at 05:07
  • Can you please decide on which code you want in the question :p – Jaromanda X Aug 08 '17 at 05:44
  • I am sorry @JaromandaX – Syed Huzaifa Hassan Aug 08 '17 at 05:45
  • No problems - I think I found the source of the slowdown - see my answer – Jaromanda X Aug 08 '17 at 05:46
  • Seeing it right now. @JaromandaX – Syed Huzaifa Hassan Aug 08 '17 at 05:55
  • I take back what I said about it being V8-centric - firefox seems to suffer the same difference, though not as pronounced when only half the values worked on are < 0 ... it's really strange actually – Jaromanda X Aug 08 '17 at 05:56
  • It's not a problem, on other hand I am gonna wait for someone with indepth knowledge v-8 or spiderMonkey so that he could tell me the reason behind why negative number are harder to solve. BTW your answer satisfies my question. Just gonna wait few hours if that ok. just need to know more indepth. @JaromandaX – Syed Huzaifa Hassan Aug 08 '17 at 06:05
  • It might be that the real culprit is the modulus operator. Since + and - are among the most trivial operations a processor offers, I can't believe that they can cause such a performance difference. Rather I think that the + operator triggers an optimization (i.e. replace %2 by &1), while the - operator doesn't. Therefore it would be great to see benchmarks for all operator combinations. – SBS Aug 08 '17 at 06:07
  • @SBS well, culprit is modulus operator, as it takes rather long time to evaluate negative numbers. dont know why. can any one explain?? – Syed Huzaifa Hassan Aug 08 '17 at 06:12
  • As I stated above, it might be a side effect of internal optimizations triggered differently by the involved operators. – SBS Aug 08 '17 at 06:15
  • @SyedHuzaifaHassan By the way... you wrote in your question that you're generally interested in optimization. You probably can gain additional performance by avoiding string concatenation += inside the loop. This calls a string constructor on every iteration, which is a costly operation. It's more effective to store the characters in a preallocated array and construct the string from it in the end all at once. I've used this trick successfully in the past, and it had a huge impact on performance. – SBS Aug 08 '17 at 06:32
  • @SBS Thanks Man for telling me this, on side note, I'm interested in performance optimization, but still expanding my knowledge thanks to you guys. I need to know more about optimization, can you point me to some good resources?? – Syed Huzaifa Hassan Aug 08 '17 at 06:39
  • Is this even valid now? Even running this here on SO doesn't show any change in running time – thekosmix Dec 16 '20 at 08:39

5 Answers5

13

This is far from a full answer and requires further investigation (or insights from someone who knows details of V8 implementation). Still, here are my findings:

Sidenode: results were gathered running Node.JS using following command line:

node --expose-gc --print-code --code-comments --print-opt-code --trace-hydrogen --redirect-code-traces --redirect-code-traces-to=code.asm --trace_representation --trace-deopt --trace-opt 1.js

and a bit of looking into the V8 source code.

1. Performance difference comes from the fact that different machine code is generated in those case. For + the code for % is

                  ;;; <@134,#123> add-i
00000151A32DD74B   395  03c2           addl rax,rdx
00000151A32DD74D   397  0f807a030000   jo 1293  (00000151A32DDACD)
                  ;;; <@136,#126> mod-by-power-of-2-i
00000151A32DD753   403  85c0           testl rax,rax
00000151A32DD755   405  790f           jns 422  (00000151A32DD766)
00000151A32DD757   407  f7d8           negl rax
00000151A32DD759   409  83e001         andl rax,0x1
00000151A32DD75C   412  f7d8           negl rax
00000151A32DD75E   414  0f846e030000   jz 1298  (00000151A32DDAD2)
00000151A32DD764   420  eb03           jmp 425  (00000151A32DD769)
00000151A32DD766   422  83e001         andl rax,0x1
                  ;;; <@138,#200> smi-tag
00000151A32DD769   425  8bd8           movl rbx,rax
00000151A32DD76B   427  48c1e320       REX.W shlq rbx, 32
                  ;;; <@140,#130> gap
00000151A32DD76F   431  488bc2         REX.W movq rax,rdx

while for - the code is much more complicated:

                  ;;; <@136,#123> sub-i
00000151A32E57E1   417  412bc3         subl rax,r11
00000151A32E57E4   420  0f8039040000   jo 1507  (00000151A32E5C23)
                  ;;; <@138,#200> int32-to-double
00000151A32E57EA   426  c5f957c0       vxorpd xmm0,xmm0,xmm0
00000151A32E57EE   430  c5fb2ac0       vcvtlsi2sd xmm0,xmm0,rax
                  ;;; <@139,#200> gap
00000151A32E57F2   434  c5f928ca       vmovapd xmm1,xmm2
                  ;;; <@140,#126> mod-d
00000151A32E57F6   438  4989e2         REX.W movq r10,rsp
00000151A32E57F9   441  4883ec28       REX.W subq rsp,0x28
00000151A32E57FD   445  4883e4f0       REX.W andq rsp,0xf0
00000151A32E5801   449  4c89542420     REX.W movq [rsp+0x20],r10
00000151A32E5806   454  48b830d4124001000000 REX.W movq rax,000000014012D430
00000151A32E5810   464  ffd0           call rax
00000151A32E5812   466  488b642420     REX.W movq rsp,[rsp+0x20]
                  ;;; <@142,#126> lazy-bailout
                  ;;; <@144,#202> number-tag-d
00000151A32E5817   471  498b9dc06f0400 REX.W movq rbx,[r13+0x46fc0]
00000151A32E581E   478  488bc3         REX.W movq rax,rbx
00000151A32E5821   481  4883c010       REX.W addq rax,0x10
00000151A32E5825   485  493b85c86f0400 REX.W cmpq rax,[r13+0x46fc8]
00000151A32E582C   492  0f878f030000   ja 1409  (00000151A32E5BC1)
00000151A32E5832   498  498985c06f0400 REX.W movq [r13+0x46fc0],rax
00000151A32E5839   505  48ffc3         REX.W incq rbx
00000151A32E583C   508  4d8b5560       REX.W movq r10,[r13+0x60]
00000151A32E5840   512  4c8953ff       REX.W movq [rbx-0x1],r10
00000151A32E5844   516  c5fb114307     vmovsd [rbx+0x7],xmm0
                  ;;; <@146,#130> gap
00000151A32E5849   521  488b45a0       REX.W movq rax,[rbp-0x60]
00000151A32E584D   525  488b7db8       REX.W movq rdi,[rbp-0x48]
00000151A32E5851   529  488b75c0       REX.W movq rsi,[rbp-0x40]
00000151A32E5855   533  488b4dc8       REX.W movq rcx,[rbp-0x38]
00000151A32E5859   537  488b55b0       REX.W movq rdx,[rbp-0x50]
00000151A32E585D   541  4c8b4da8       REX.W movq r9,[rbp-0x58]
00000151A32E5861   545  4c8b4598       REX.W movq r8,[rbp-0x68]
00000151A32E5865   549  c5fb109578ffffff vmovsd xmm2,[rbp-0x88]

In short the difference is that for the "plus" case Mod (%) operation is performed using highly specialized mod-by-power-of-2-i machine code but for the "minus" case it is done using mod-d which is floating point-based arithmetic implementation.

Notice also that mod-by-power-of-2-i machine code does support negative values. It can be roughly re-written as something like this:

if (rax < 0) {
    rax = -rax;
    rax = rax & 1;
    rax = -rax;
}
else {
    rax = rax & 1;
}

So this is not a case of optimized machine code only for positive values.

2. The difference in generated code seem to come from the fact that type inference works differently. Logs produced by --trace_representation show the following difference between "plus" and "minus" cases for a simplified program:

var f_minus = function(log) {
    var str = '', i = gridSize, j;
    while (i--) {
      j = gridSize;
      while (j--) {
        var ttt = (i - j) % 2
      }
    }

  if(log) {
     if(ttt == -1)
        console.log(t);
   }
}


var f_plus = function(log) {
    var str = '', i = gridSize, j;
    while (i--) {
      j = gridSize;
      while (j--) {
        var ttt = (i + j) % 2
      }
    }

    if(log){
     if(ttt == -1)
        console.log(t);
    }
}

Compare

[marking 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 00000025D4303E91 <JS Function f_minus (SharedFunctionInfo 00000278933F61C1)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Sub as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #101 Mod representation i -> d based on output
Changing #98 Sub representation v -> s based on inputs
Changing #98 Sub representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...

to this

[marking 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> for optimized recompilation, reason: small function, ICs with typeinfo: 8/12 (66%), generic ICs: 0/12 (0%)]
[compiling method 000002C17CAAB341 <JS Function f_plus (SharedFunctionInfo 00000278933F6289)> using Crankshaft OSR]
#37 Phi is used by real #110 Branch as v
#38 Phi is used by real #58 Add as t
#38 Phi is used by real #69 StackCheck as v
#38 Phi is used by real #70 LoadContextSlot as v
#38 Phi is used by real #122 CompareGeneric as t
#38 Phi is used by real #132 LoadGlobalGeneric as v
#38 Phi is used by real #134 LoadNamedGeneric as v
#38 Phi is used by real #136 LoadGlobalGeneric as v
#38 Phi is used by real #141 CallWithDescriptor as v
#38 Phi is used by real #156 Return as v
#38 Phi is used by real #101 Mod as t
#38 Phi is used by real #98 Add as t
#38 Phi is used by real #95 StackCheck as v
#38 Phi is used by real #84 Add as t
#47 Phi is used by real #56 ForceRepresentation as s
#49 Phi is used by real #122 CompareGeneric as t
#77 Phi is used by real #83 ForceRepresentation as s
generalizing use representation 'v' of #40 Phi with uses of #47 Phi 's'
generalizing use representation 'v' of #42 Phi with uses of #49 Phi 't'
generalizing use representation 't' of #42 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #49 Phi with uses of #78 Phi 'v'
generalizing use representation 'v' of #78 Phi with uses of #49 Phi 't'
Changing #101 Mod representation v -> i based on inputs
Changing #98 Add representation v -> s based on inputs
Changing #98 Add representation s -> i based on use requirements
Changing #84 Add representation v -> i based on inputs
...

The interesting difference is the line

Changing #101 Mod representation i -> d based on output

that is only present in the f_minus but not the f_plus case. For some reason the compiler believes that in the f_minus case type should be Double instead of Integer basing the guess on the output value. Interestingly, if I change the line

        var ttt = (i - j) % 2

to

        var ttt = (i - j + gridSize) % 2; 

it again starts generating fast mod-by-power-of-2-i code. So yes, it looks probable that the output value affects optimizing compiler. But it is not clear why this happens in this particular case.

At the first glance this behavior looks like a bug in the optimizing compiler that fails to notice that the "minus" case can be handled by mod-by-power-of-2-i as well. I was not able to trace why this happens just glancing over the source code so further input is welcome.

SergGr
  • 23,570
  • 2
  • 30
  • 51
  • Great answer! However, it seems that `(-i-j)%3` instead of `(+i+j)%3` leads to comparable performance degradation. I guess, the efficiency of specialized `mod-by-power-of-2-i` is only half the battle. Maybe the necessity of preserving the divident sign (`-1 % 2` is `-1`, you know) forces floating-point based implementations or double `rax = -rax;` (which has its price)... – Stanislav Kralin Aug 17 '17 at 19:55
  • @StanislavKralin, the crucial difference is obviously not in `mod-by-power-of-2-i` which gives some additional benefit. The important difference is integer vs floating point arithmetic. For `% 3` in case of `+` `mod-by-const-i` code is generated which is a bit less efficient than `mod-by-power-of-2-i` but still much faster than `mod-d`. And this (integer vs float) is what I point to in the second half of my (unfinished) investigation. – SergGr Aug 17 '17 at 20:40
  • @StanislavKralin, as for preserving sign I explicitly mentioned in my answer that `mod-by-power-of-2-i` can handle negative values (see paragraphs just above **2.**). So the issue is not with that code but with some higher level logic that guess wrongly at the type-inference stage. – SergGr Aug 17 '17 at 20:54
  • thanks for your answer it explains the answer well, exactly what I was hoping for – Syed Huzaifa Hassan Aug 17 '17 at 23:09
5

instead of using the costly modular operation in

((i - j) % 2 === 0)

you can use bitwise operations

(((i-j)&1) === 0)

As suggested by SBS in the comments you should also try

(((i^j)&1) === 0)
marvel308
  • 10,288
  • 1
  • 21
  • 32
4

My tests (average of 5 runs each) in nodejs show

(i - j) % 2 // 1170ms
(i + j) % 2 //  720ms
Math.abs(i - j) % 2 // 720ms
Math.abs(i + j) % 2 // 720ms
(gridSize + i + j) % 2 // 715ms
(gridSize + i - j) % 2 // 710ms
(-i - j) % 2 // 1500ms

Some oddness there, the big surprise is calling Math.abs has virtually zero effect on the i + j case, but even more surprising is that adding gridSize makes the gridSize + i - j case the fastest!!

But what I can take from it is that the main issue is in

(i - j) % 2

Many (i - j) are < 0 (half of them?)

With (-i - j) ALL values are < 0

Conclusion: When faced with a modulo operation on a negative number, performance decreases significantly

note, you should be able to use

console.time("looping");
...
console.timeEnd("looping");

in your browser as well, so you can run the same code without needing to use performance.now() in the browser

Not sure how cromulent this "benchmark" is but

console.time("positive");
(function() {
    var size = 100000;
    var v = 0;
    while (size--) {
        v+=(+size)%2
    }
})();
console.timeEnd("positive");

console.time("negative");
(function() {
    var size = 100000;
    var v = 0;
    while (size--) {
        v+=(-size)%2
    }
})();
console.timeEnd("negative");
Jaromanda X
  • 53,868
  • 5
  • 73
  • 87
3

Your main problem is nested loops. Try to avoid them:

console.time("looping");
var gridSize = 5000;
var board = "";
var firstLine = "";
var secondLine = "";

var counter = 0;
var isBlack = true;

while (counter < gridSize) {
    firstLine += isBlack ? "#" : " ";
    secondLine += isBlack ? " " : "#";
    isBlack = !isBlack ;
    counter++ ;
}

var counter = 0;
var isBlack = false;
while (counter < gridSize) {
    board =  board + (isBlack ? firstLine : secondLine) + "\n";
    isBlack = !isBlack;
    counter++;
}
// console.log(board);
console.timeEnd("looping");

There exists more efficient algorithm to fill large string with short repeatable pattern. However, the code will be too complicated. I hope, this algorithm is used in the padStart() implementations.

console.time("looping");
var gridSize = 5000;
var firstLine = "".padStart(gridSize, " #");
var secondLine = "".padStart(gridSize, "# ");
var board = "".padStart(gridSize * (gridSize + 1), firstLine + "\n" + secondLine + "\n");
// console.log(board);
console.timeEnd("looping");

Results on my machine (Windows 7, Firefox 54.0 64-bit):

+------------------------+---------+
|         Method         |  Time   |
+------------------------+---------+
| nested loops, (-i-j)%2 | 1500 ms |
| nested loops,  (i+j)%2 |  500 ms |
| nested loops,  (i^j)&1 |  500 ms |
| manual filling         |    2 ms |
| padStart()             |    1 ms |
+------------------------+---------+

By the way, FIDE rule says:

The chessboard is placed between the players in such a way that the near corner square to the right of the player is white.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
2

The most time consuming operation in the snippet is string concat. So I decided to remove that and use an actual array instead. Below is 2 implementations, one from the book, other is from your code. Book version is twice fast on my machine. Also, making that change speed up the code about ~6 times. Maybe array resize also has an effect here and we should test a version without arrays as well.

My conclusion: Problem may not be actually the negative modulus, but some other parts could be affecting the benchmarks.

console.time("looping");

var board = [];

var size = 5000;

for (var y = 0; y < size; y++) {
  board[y]=[];
  for (var x = 0; x < size; x++) {
    if ((x + y) % 2 === 0)
       board[y][x] = " ";
    else
       board[y][x] = "#";
   }
    
}

 
console.timeEnd("looping");
   

    console.time("looping2");
    var gridSize = 5000, str = [], i = gridSize, j;
    while (i--) {
      str[i]=[];
      j = gridSize;
      while (j--) {
        if ((i - j) % 2 === 0)
          str[i][j] = " ";
        else
          str[i][j] = "#";
      }
    }

     console.timeEnd("looping2");
xycf7
  • 913
  • 5
  • 10