20

If I give Closure Compiler something like this:

window.array = '0123456789'.split('');

It "compiles" it to this:

window.array="0,1,2,3,4,5,6,7,8,9".split(",");

Now as you can tell, that's bigger. Is there any reason why Closure Compiler is doing this?

qwertymk
  • 34,200
  • 28
  • 121
  • 184
  • 1
    Nope, there's no reason as far as I know, but nothing's perfect at minification... – Ry- Apr 18 '12 at 13:33
  • 7
    Maybe because it is faster (at least in Chrome, even if not by much): http://jsperf.com/empty-split-vs-comma-split Would be interesting to see the difference in other browsers... *edit:* In Safari and Firefox it appears to be slower though... then I don't know. But in general, CC favours speed over size in edge cases. – Felix Kling Apr 18 '12 at 13:33
  • Maybe for compatibility? – SLaks Apr 18 '12 at 13:35
  • @Matt: That's why I said _maybe_. But that's the kind of thing that I might expect IE6 to mess up. – SLaks Apr 18 '12 at 13:43
  • My initial thought was the same as @SLaks but I am almost certain that it will work everywhere. I have tested just it in IE6 and it has no issues that I can find. – James Allardice Apr 18 '12 at 14:57
  • 2
    Just tested - it doesn't work in my IE4. – Florian Margaine Nov 20 '12 at 13:31

3 Answers3

21

I think this is what's going on, but I am by no means certain...

The code that causes the insertion of commas is tryMinimizeStringArrayLiteral in PeepholeSubstituteAlternateSyntax.java.

That method contains a list of characters that are likely to have a low Huffman encoding, and are therefore preferable to split on than other characters. You can see the result of this if you try something like this:

"a b c d e f g".split(" "); //Uncompiled, split on spaces
"a,b,c,d,e,f,g".split(","); //Compiled, split on commas (same size)

The compiler will replace the character you try to split on with one it thinks is favourable. It does so by iterating over the characters of the string and finding the most favourable splitting character that does not occur within the string:

// These delimiters are chars that appears a lot in the program therefore
// probably have a small Huffman encoding.
NEXT_DELIMITER: for (char delimiter : new char[]{',', ' ', ';', '{', '}'}) {
  for (String cur : strings) {
    if (cur.indexOf(delimiter) != -1) {
      continue NEXT_DELIMITER;
    }
  }
  String template = Joiner.on(delimiter).join(strings);
  //...
}

In the above snippet you can see the array of characters the compiler claims to be optimal to split on. The comma is first (which is why in my space example above, the spaces have been replaced by commas).

I believe the insertion of commas in the case where the string to split on is the empty string may simply be an oversight. There does not appear to be any special treatment of this case, so it's treated like any other split call and each character is joined with the first appropriate character from the array shown in the above snippet.


Another example of how the compiler deals with the split method:

"a,;b;c;d;e;f;g".split(";"); //Uncompiled, split on semi-colons
"a, b c d e f g".split(" "); //Compiled, split on spaces

This time, since the original string already contains a comma (and we don't want to split on the comma character), the comma can't be chosen from the array of low-Huffman-encoded characters, so the next best choice is selected (the space).


Update

Following some further research into this, it is definitely not a bug. This behaviour is actually by design, and in my opinion it's a very clever little optimisation, when you bear in mind that the Closure compiler tends to favour the speed of the compiled code over size.

Above I mentioned Huffman encoding a couple of times. The Huffman coding algorithm, explained very simply, assigns a weight to each character appearing the the text to be encoded. The weight is based on the frequency with which each character appears. These frequencies are used to build a binary tree, with the most common character at the root. That means the most common characters are quicker to decode, since they are closer to the root of the tree.

And since the Huffman algorithm is a large part of the DEFLATE algorithm used by gzip. So if your web server is configured to use gzip, your users will be benefiting from this clever optimisation.

James Allardice
  • 164,175
  • 21
  • 332
  • 312
  • @Bergi - I am inclined to agree. However, I think we would need to see more performance tests in different browsers comparing splitting on the empty string and the characters in the array of low Huffman encoded characters before we can say for sure. – James Allardice Apr 18 '12 at 14:48
  • Speaking of that, splitting vs. regular assignment is already very slow: [jsperf](http://jsperf.com/array-assign-vs-split). – Oleg V. Volkov Apr 18 '12 at 14:54
  • @OlegV.Volkov - Yes, and the Closure compiler will convert a call to `split` into a direct assignment if it results in shorter code. – James Allardice Apr 18 '12 at 14:55
  • @Bergi: Seems to be a **major** bug as every `"a b c d e f g".split(" ");` can be `"abcdefg".split("");` and save almost half the bytes. – qwertymk Apr 18 '12 at 21:46
  • @FelixKling - Thanks! Can you (or anyone else) think of any situation that would cause `.split("")` to fail or work in an unexpected way? I don't think there are any such situations (I've tested it in a lot of browsers and found no issues at all), so it's probably worth filing a bug report. It just seems like quite a big oversight so I'm surprised it hasn't come up before, which makes me think it may be by design. – James Allardice Apr 18 '12 at 22:07
  • Ok, it has come up before (I'm surprised I didn't find that issue before) - see the link posted by thg435 in the comments on their answer. It's not worth filing a bug report. – James Allardice Apr 18 '12 at 22:10
  • @James Allardice You give the compiler a little too much credit. It transforms the constant split into an array, in the hopes that it can transform the array into something simpler. This was put in to support the "remove unused parts of jQuery in ADVANCED mode" effort. jQuery uses this pattern in someplaces to define methods. qwertymk if this is significant to you it would a fairly straightforward patch. – John Apr 19 '12 at 15:30
5

This issue was fixed on Apr 20, 2012 see revision: https://code.google.com/p/closure-compiler/source/detail?r=1267364f742588a835d78808d0eef8c9f8ba8161

John
  • 5,443
  • 15
  • 21
4

Ironically, split in the compiled code has nothing to do with split in the source. Consider:

Source  : a = ["0","1","2","3","4","5"]
Compiled: a="0,1,2,3,4,5".split(",")

Here, split is just a way to represent long arrays (long enough for sum of all quotes + commas to be longer than split(","") ). So, what's going on in your example? First, the compiler sees a string function applied to a constant and evaluates it right away:

'0123456789'.split('') => ["0","1","2","3","4","5","6","7","8","9"]

At some later point, when generating output, the compiler considers this array to be "long" and writes it in the above "split" form:

["0","1","2","3","4","5","6","7","8","9"] => "0,1,2,3,4,5,6,7,8,9".split(",")

Note that all information about split('') in the source is already lost at this point.

If the source string were shorter, it would be generated in the array array form, without extra splitting:

Source  : a = '0123'.split('')
Compiled: a=["0","1","2","3"]
georg
  • 211,518
  • 52
  • 313
  • 390
  • So then shouldn't `['0', '1', '2', '3', '4', '5', '6', '7']` always compile to `'01234567'.split('')` ? – qwertymk Apr 18 '12 at 21:47
  • @qwertymk - That's exactly what it does do isn't it? It will compile any direct string array assignment to a `split` call, as long as the number of characters required is less. Have a look at the implementation of the method mentioned in my answer to see exactly what goes on. – James Allardice Apr 18 '12 at 21:51
  • @JamesAllardice: I mean without spaces and empty delimiter. – qwertymk Apr 18 '12 at 21:52
  • @qwertymk - Oh yeah sorry, I misread your comment. Yes, that probably is what it should compile to. I think you may have come across quite a big oversight by the Closure compiler team! – James Allardice Apr 18 '12 at 21:53
  • 3
    @qwertymk: yes, they could check if all strings in array are of length 1 and provide an optimization for this, however, to quote one of the maintainers, [the single letter example... seems like a special case and I would have to question whether it's worth the effort to address](http://code.google.com/p/closure-compiler/issues/detail?id=702) – georg Apr 18 '12 at 22:05
  • @thg435 - Good find. I looked at the issues but never noticed that one. – James Allardice Apr 18 '12 at 22:11