364

I was baffled when a colleague showed me this line of JavaScript alerting 42.

alert(2+ 40);

It quickly turns out that what looks like a minus sign is actually an arcane Unicode character with clearly different semantics.

This left me wondering why that character doesn't produce a syntax error when the expression is parsed. I'd also like to know if there are more characters behaving like this.

GOTO 0
  • 42,323
  • 22
  • 125
  • 158
  • Did you try to run this from the console? I get -38 as result. Not sure why. – Ely Jul 19 '15 at 23:53
  • Maybe you two are running different browsers, might be browser dependent. – Lansana Camara Jul 19 '15 at 23:54
  • It's the minus sign I just found out. – Ely Jul 19 '15 at 23:54
  • 28
    @Elyasin Did you copy/paste or retype? – user253751 Jul 20 '15 at 01:48
  • I copy and pasted yours. When I typed with my keyboard it was ok. – Ely Jul 20 '15 at 08:16
  • 4
    This works in Visual C# as well. When pasting the strange character into the Visual Studio IDE, or when completing the statement by typing `;`, the editor tends to change the strange ` ` character into a normal space, but if you undo that "auto-correction", you have the same behavior. That character has the same semantics as a space, even if it looks like a hyphen or minus (in usual fonts). – Jeppe Stig Nielsen Jul 20 '15 at 08:30
  • 4
    The opposite can happen as well. Some languages supporting unicode in identifiers accept unicode characters that look like white space (in other words, you can't see them); it may even be possible to have completely invisible identifiers. – gnasher729 Jul 20 '15 at 09:45
  • 1
    What I want to know is how these get into code in the first place... – Izkata Jul 20 '15 at 17:57
  • 58
    (OT) Because 42 is an answer to _everything?_ – ivan_pozdeev Jul 21 '15 at 22:12
  • 1
    On Opera I see a + sign and a funny less than icon in a square. Still wondering how it can be parsed... – akaltar Jul 22 '15 at 15:57
  • Re: getting them into code - should be expanded to 7-bit printables. as well as displaying properly on monitors, they will print. I am not find of programmers using hex editors on source. – mckenzm Jul 22 '15 at 16:03
  • 2
    I see no minus sign, just a box with the hex code. – IS4 Jul 23 '15 at 10:11
  • 2
    minus1 for a non-real question: How did you know to tag this as Unicode at the time of asking question? – Thomas Weller Jul 23 '15 at 13:45
  • 4
    @Thomas the fact that the unexpected result was caused by that Unicode character was already clear. – GOTO 0 Jul 23 '15 at 14:55
  • 2
    Kudos for getting the answer to the universe. Make sure you bring your towel to work tomorrow. –  Jul 26 '15 at 17:49
  • 3
    There is also more advanced unicode trolling like russian chars (АВЕМНОРСТХаеорсух) and finally, capital "з" - "З": `var З=100;alert(З+2)`. Glyphs are acceptable identifiers too (javascript, php). – Sanya_Zol Jul 28 '15 at 17:33
  • 1
    I feel blessed seeing the character as a box with numbers. – IS4 May 14 '17 at 22:14

6 Answers6

472

That character is "OGHAM SPACE MARK", which is a space character. So the code is equivalent to alert(2+ 40).

I'd also like to know if there are more characters behaving like this.

Any Unicode character in the Zs class is a white space character in JavaScript, but there don't seem to be that many.

However, JavaScript also allows Unicode characters in identifiers, which lets you use interesting variable names like ಠ_ಠ.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
  • 3
    Box-with-a-hex-code underscore box-with-a-hex-code. Which character is it meant to be? – user253751 Jul 20 '15 at 01:48
  • 12
    @immibis The last part of this answer is an emoticon available in image form at http://www.disapprovallook.com/ – Mark S. Jul 20 '15 at 02:54
  • I'll remember that next time I need to store len(e) in a variable. – Jasen Jul 20 '15 at 11:50
  • 1
    @AnderBiguri Use Python 3. AFAIK there as well you can use most Unicode characters in a variable name. – glglgl Jul 20 '15 at 12:27
  • 1
    You can use Unicode letters in Java identifiers as well. – David Conrad Jul 20 '15 at 17:34
  • Fun little tidbit: When this comes up in a Google search, you just see the space, no symbol. – Broots Waymb Jul 20 '15 at 21:08
  • 3
    Note that not just `Zs` characters are considered to be white space in JavaScript. There are more: https://github.com/mathiasbynens/regexpu/blob/ff17a00b63a017a69fd93da455a8944eb18918ce/scripts/character-class-escape-sets.js#L69-L87 – Mathias Bynens Jul 21 '15 at 08:12
  • 1
    Given this, the zero width no-break space, and the other space-alikes, I suspect a clever coder could hide some quite nasty things in innocuous looking SVN commits to CMS source. – Dewi Morgan Jul 22 '15 at 04:33
  • `ಠ_ಠ`, more like Hoo-dini – Khaled.K Jul 22 '15 at 06:58
  • 22
    My reaction when `ಠ_ಠ` can be used as an identifier in JS: **ಠ_ಠ** – Chris Cirefice Jul 22 '15 at 14:45
  • 4
    @ChrisCirefice underscore being treated as a letter is long-standing in C-style langauges. `ಠ` being treated as a letter is just common sense, since it's a letter. It would be a clear bug if `ಠ_ಠ` couldn't be used as an identifier. – Jon Hanna Jul 24 '15 at 08:52
  • @JonHanna You're right about that - I took a compilers course for C, so I know how identifiers work ;) it's just interesting to see crazy unicode symbols instead of the typical `_a-zA-Z][_a-zA-Z0-9]` range (at least that's what we implemented in our compiler). The only problem with using unicode like that is that anyone opening your files with ASCII encoding (and many others) is going to have a bad time trying to figure out what the hell you coded :P – Chris Cirefice Jul 24 '15 at 14:42
  • 1
    @ChrisCirefice `ಠ` isn't a "crazy unicode symbol" it's a letter from the Kannada alphabet. If someone uses Kannada in their identifiers it's not going to be very readable to me, but it might be a lot more readable to them than my code is using identifiers in crazy unicode symbols like `width`, `index`, `hashCode` etc. – Jon Hanna Jul 24 '15 at 14:53
  • @JonHanna You know I [had looked up that symbol](http://unicodelookup.com/#ಠ/1), but I didn't really read too much into *kannada letter*. Well, still, unicode poses its problems in many editors :) – Chris Cirefice Jul 24 '15 at 15:04
  • 1
    RE `ಠ_ಠ`: that's *one* way to become the world's least loved JavaScript programmer – Anonymous Penguin Jul 26 '15 at 21:22
82

After reading the other answers, I wrote a simple script to find all Unicode characters in the range U+0000–U+FFFF that behave like white spaces. As it seems, there are 26 or 27 of them depending on the browser, with disagreements about U+0085 and U+FFFE.

Note that most of these characters just look like a regular white space.

function isSpace(ch)
{
    try
    {
        return Function('return 2 +' + ch + ' 2')() === 4;
    }
    catch(e)
    {
        return false;
    }
}

for (var i = 0; i <= 0xffff; ++i)
{
    var ch = String.fromCharCode(i);
    if (isSpace(ch))
    {
        document.body.appendChild(document.createElement('DIV')).textContent = 'U+' + ('000' + i.toString(16).toUpperCase()).slice(-4) + '    "' + ch + '"';
    }
}
div { font-family: monospace; }
GOTO 0
  • 42,323
  • 22
  • 125
  • 158
  • 17
    U+0085 "NEL" is defined as whitespace by Unicode but has a long history of being mishandled. U+FFFE is a noncharacter with no name and no properties besides NChar and shouldn't be be considered whitespace by anything reasonable. That said, my browser disagrees with me on both points :) – hobbs Jul 20 '15 at 06:58
  • 4
    @hobbs U+FFFE is also a `\p{Default Ignorable Code Point}`, not just a `\p{Noncharacter Code Pount}`. U+0085 has always been a `\p{Whitespace}` code point. The evil one is U+180E MONGOLIAN VOWEL SEPARATOR, which “recently” lost its `\p{Whitespace}` property. Note that `\p{Pattern Whitespace}` is a much smaller set, and an immutable property. But `\p{Whitespace}` is not. – tchrist Jul 20 '15 at 14:47
  • 2
    `FEFF` is the BOM and can be treated like a "zero width no-break space" within texts. `FFFE` is it's endian swapped equivalent. Perhaps that's the reason some browsers treat is as whitespace. – CodesInChaos Jul 20 '15 at 15:57
  • http://www.ecma-international.org/ecma-262/6.0/#sec-white-space (as linked from Felix King's answer) specifically calls out U+FEFF to be considered whitespace in JS source code. U+FFFE is not listed, but that strikes me as an error of omission. – zwol Jul 21 '15 at 21:55
  • 2
    @zwol, it's not an error of omission, because there is no character U+FFFE. Treating it as whitespace is a bug. Indeed, treating it as a valid character at all is a bug in most cases. U+0085 is not white space according to the JS spect, but that spec's requiring special-casing of U+0085 to not be a new line is bizarre and arguably a bug in the spec. – Jon Hanna Jul 24 '15 at 08:59
  • @JonHanna You are formally correct on both counts, but I am certain that the requirement _not_ to treat U+0085 as a newline arises from bitter experience of the crap out there on the Web, and I am surprised that there is not a requirement to treat U+FFFE as if it were U+FFEF for the same reason. – zwol Jul 24 '15 at 14:26
  • If you're getting U+FFFE where you should have U+FFEF, then things are likely too messed up for anything to work anyway. If you're getting U+FFFE where you should have U+FEFF and it's not at the very start of the document, things are almost certainly too screwed up, but it's more likely to be the parser's fault than the producer. – Jon Hanna Jul 24 '15 at 14:50
58

It appears that the character that you are using is actually longer than what the actual minus sign (a hyphen) is.

 
-

The top is what you are using, the bottom is what the minus sign should be. You do seem to know that already, so now let's see why Javascript does this.

The character that you use is actually the ogham space mark which is a whitespace character, so it is basically interpreted as the same thing as a space, which means that your statement looks like alert(2+ 40) to Javascript.

There are other characters like this in Javascript. You can see a full list here on Wikipedia.


Something interesting I noticed about this character is the way that Google Chrome (and possible other browsers) interprets it in the top bar of the page.

enter image description here

It is a block with 1680 inside of it. That is actually the unicode number for the ogham space mark. It appears to be just my machine doing this, but it is a strange thing.


I decided to try this out in other languages to see what happens and these are the results that I got.


Languages it doesn't work in:

Python 2 & 3

>> 2+ 40
  File "<stdin>", line 1
    2+ 40
        ^
SyntaxError: invalid character in identifier

Ruby

>> 2+ 40
NameError: undefined local variable or method ` 40' for main:Object
    from (irb):1
    from /home/michaelpri/.rbenv/versions/2.2.2/bin/irb:11:in `<main>'

Java (inside the main method)

>> System.out.println(2+ 40);
Main.java:3: error: illegal character: \5760
            System.out.println(2+?40);
                                 ^
Main.java:3: error: ';' expected
            System.out.println(2+?40);
                                  ^
Main.java:3: error: illegal start of expression
            System.out.println(2+?40);
                                    ^
3 errors

PHP

>> 2+ 40;
Use of undefined constant  40 - assumed ' 40' :1

C

>> 2+ 40
main.c:1:1: error: expected identifier or '(' before numeric constant
 2+ 40
 ^
main.c:1:1: error: stray '\341' in program
main.c:1:1: error: stray '\232' in program
main.c:1:1: error: stray '\200' in program

exit status 1

Go

>> 2+ 40
can't load package: package .: 
main.go:1:1: expected 'package', found 'INT' 2
main.go:1:3: illegal character U+1680

exit status 1

Perl 5

>> perl -e'2+ 40'                                                                                                                                   
Unrecognized character \xE1; marked by <-- HERE after 2+<-- HERE near column 3 at -e line 1.

Languages it does work in:

Scheme

>> (+ 2  40)
=> 42

C# (inside the Main() method)

Console.WriteLine(2+ 40);

Output: 42

Perl 6

>> ./perl6 -e'say 2+ 40' 
42
Community
  • 1
  • 1
michaelpri
  • 3,521
  • 4
  • 30
  • 46
  • 35
    Ubuntu isn't the problem. The window title font you're using is. – Petr Skocik Jul 20 '15 at 00:22
  • 2
    firefox (iceweasel) and google chrome on debian seem to display the unicode char just fine, although I have gone to lengths to ensure unicode compatibility on my system. (actually, the most useful thing I did was the simplest: `sudo apt-get install unicode`, although only after hours of research and failed attempts) – sig_seg_v Jul 20 '15 at 00:23
  • @PSkocik Interesting, I have had font problems on here before, so that is probably likely – michaelpri Jul 20 '15 at 00:24
  • @PSk while I don't have experience with ubuntu directly, debian definitely doesn't come with full unicode compatibility "out of the box", although it is certainly supported by the distribution.... I would be interested to hear your results if a fix like the package I suggested works for you, michaelpri – sig_seg_v Jul 20 '15 at 00:27
  • @sig_seg_v I did just try that command, but it did not work. I'm gonna do a bit more research – michaelpri Jul 20 '15 at 00:28
  • "longer than what a normal minus sign would be on the keyboard" is the strangest description. – Samuel Edwin Ward Jul 20 '15 at 00:44
  • off-topic, but how can you have a title bar in Chrome? – phuclv Jul 20 '15 at 04:18
  • @LưuVĩnhPhúc Must be a part of Ubuntu's version of Google Chrome – michaelpri Jul 20 '15 at 04:19
  • Interesting. The StackOverflow app on my iPhone and all occurrences of that character look like that. It's in Helvetica Neue (UltraLight for title bars). – DDPWNAGE Jul 20 '15 at 05:39
  • 51
    @PSkocik _“Ubuntu isn't the problem. The window title font you're using is.”_ …which is “[Ubuntu](http://font.ubuntu.com/)”. – Sebastian Simon Jul 20 '15 at 12:47
  • @michaelpri In your Java test, it seems you saved the file as UTF-8, but failed to tell the encoding to the compiler. With proper encoding, the character is still not accepted, but with a different error message. Also, similar to the C# example, the expression needs to be wrapped in some boilerplate code to even have a chance at being syntactically correct. – Christian Semrau Jul 21 '15 at 18:36
  • @ChristianSemrau For the C# code, I just put the line that has the actual code. The class and main method were there when I ran it, I just didn't want to bulk up the answer. For Java, I used [this](https://repl.it/languages/Java) and ran the expression in the terminal on the side. That is the error message it gave. I've edited the answer with what would happen if you ran `System.out.println(2+ 40);`. – michaelpri Jul 21 '15 at 18:40
  • I appreciate that you only provide the actual source line without the boilerplate. The error message is still misleading: The REPL uses ASCII encoding, thus disallowing any characters above ASCII code 127. When you replace the character with the equivalent Java escape sequence, you get `System.out.println(2+\u168040);`, which results in 3 error messages. The first is `Main.java:3: error: illegal character: '\u1680'`, the other two are the same as the last two in your answer. – Christian Semrau Jul 21 '15 at 19:47
  • I see [?] in iOS Safari, Chrome and the SO app – mplungjan Jul 21 '15 at 19:59
  • @ChristianSemrau My computer is getting compile errors when I try to do that (I'm using Eclipse). If you would like, you can edit the error post with the one that you get. – michaelpri Jul 21 '15 at 20:27
  • @michaelpri With proper encoding you get 3 compile errors. With IntelliJ and UTF-8, I get the first error `Error:(22, 30) java: illegal character: \5760` (the offending character is represented as octal). With the REPL you used, you got 5 compile errors, because the character is encoded as 3 bytes via UTF-8, and each of these bytes is an invalid ASCII character. I'd prefer you post the 3-error-version. :-) – Christian Semrau Jul 22 '15 at 17:37
  • @ChristianSemrau Would you mind editing it in? The REPL doesn't seem to work great and Eclipse is throwing a compile time error when I try to use the character or `\u1680`. – michaelpri Jul 22 '15 at 17:41
  • I inserted the output from my commandline Java compiler (from JDK 1.7.0_45). – Christian Semrau Jul 22 '15 at 18:09
  • perl5 fails with perl -e'2+ 40' Unrecognized character \xE1; marked by <-- HERE after 2+<-- HERE near column 3 at -e line 1. but perl6 which embraces unicode a bit more has the same problem: $ ./perl6 -e'say 2+ 40' 42 – rurban Jul 22 '15 at 23:43
  • 1
    @PSkocik I finally fixed it :) Just needed to change the system title bar font. – michaelpri Jul 24 '15 at 05:57
  • @LưuVĩnhPhúc Chrome, at least on Linux, has a setting to [use the system title bars](http://i.imgur.com/NgQLlC8.png) rather than its own. – Michael Hampton Jul 27 '15 at 02:39
43

I guess it has to do something with the fact that for some strange reason it classifies as whitespace:

$ unicode  
U+1680 OGHAM SPACE MARK
UTF-8: e1 9a 80  UTF-16BE: 1680  Decimal: &#5760;
  ( )
Uppercase: U+1680
Category: Zs (Separator, Space)
Bidi: WS (Whitespace)
Petr Skocik
  • 58,047
  • 6
  • 95
  • 142
6

I'd also like to know if there are more characters behaving like this.

I seem to remember reading a piece a while back about mischievously replacing semi-colons (U+003B) in someone's code with U+037E which is the Greek question mark.

They both look the same (to the extent that I believe the Greeks themselves use U+003B) but this article stated that the other one wouldn't work.

Some more information on this from Wikipedia is here: https://en.wikipedia.org/wiki/Question_mark#Greek_question_mark

And a (closed) question on using this as prank from SO itself. Not where I originally read it AFAIR though: JavaScript Prank / Joke

Community
  • 1
  • 1
noonand
  • 2,763
  • 4
  • 26
  • 51
1

Many languages won't compile this expression, but I was curious what Rust's compiler had to say on the topic. It is notoriously strict but will often give us knowledge and wisdom with loving kindness.

So I asked it to compile this:

fn main() {
    println!("{}", (2+ 40));
}

And the compiler replied:

error: unknown start of token: \u{1680}
  |
  |     println!("{}", (2+ 40));
  |                       ^
  |
help: Unicode character ' ' (Ogham Space mark) looks like ' ' (Space), but it is not

JavaScript, on the other hand, (tested with the latest and most commonly used browser today) seems to be pretty chill about that character and simply ignores it.

at54321
  • 8,726
  • 26
  • 46