0

The below excerpts refer to ECMAScript 2017.

10.1 Source Text, Syntax

Escape sequences, like \u000A, will not be interpreted as line terminators (i.e. new lines):

In string literals, regular expression literals, template literals and identifiers, any Unicode code point may also be expressed using Unicode escape sequences that explicitly express a code point's numeric value. Within a comment, such an escape sequence is effectively ignored as part of the comment.

ECMAScript differs from the Java programming language in the behaviour of Unicode escape sequences.

If the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is interpreted as a line terminator, which is not allowed within a string literal.

A Unicode escape sequence occurring within a string literal in an ECMAScript program, always contributes to the literal and is never interpreted as a line terminator or as a code point that might terminate the string literal.

11.8.4 String Literals

Code points may appear as escape sequences in string literals, except reverse solidus (\).

A string literal is zero or more Unicode code points enclosed in single or double quotes. Unicode code points may also be represented by an escape sequence. All code points may appear literally in a string literal except for the closing quote code points, U+005C (REVERSE SOLIDUS), U+000D (CARRIAGE RETURN), U+2028 (LINE SEPARATOR), U+2029 (PARAGRAPH SEPARATOR), and U+000A (LINE FEED). Any code points may appear in the form of an escape sequence.

Questions

  1. How can an escape sequence occur inside a string literal, if \ is not allowed (11.8.4)?
  2. 11.8.4. states that code points may be represented as escape sequences. 10.1 states that escape sequence \u000A inside a string literal is not interpreted as a line terminator. These two seem contradictory. If it is not interpreted as a line break inside the string literal, then how is it interpreted (if at all)?
Magnus
  • 6,791
  • 8
  • 53
  • 84

1 Answers1

2

How can an escape sequence occur inside a string literal, if \ is not allowed (11.8.4)?

I think the key part of that section is "appear literally", which is saying that a \ in the string literal does not translate into a backslash in the resulting string itself. It's not saying backslashes are disallowed, it is saying they don't "appear literally".

10.1 states that escape sequence \uu000A inside a string literal is not interpreted as a line terminator.

You skipped the earlier part of that quote "always contributes to the literal". \u000A is perfectly allowed, and does get added to the content of the string. That code is saying that it isn't treated as a line terminator in the sense of the lexical grammar. It is saying that

var foo = "one\u000Atwo";

is allowed even though

var foo = "one
two";

is a syntax error. Both try to use a newline codepoint between words, but the first is allowed because it isn't actually treated as a line-terminator from the standpoint of the lexer.

loganfsmyth
  • 156,129
  • 30
  • 331
  • 251
  • Thanks, Logan. On your last point, how is it not treated as a line terminator if the end result is in-fact a String value with a line-break in it (i.e. the escape sequence ended up being interpreted as a line terminator)? – Magnus Apr 03 '18 at 17:01
  • Section 11 is all about the lexical grammar overall The value is determined by the SV algorithm in https://www.ecma-international.org/ecma-262/8.0/#sec-string-literals-static-semantics-stringvalue, as stated in https://www.ecma-international.org/ecma-262/8.0/#sec-literals-runtime-semantics-evaluation, which is separate from the section's overall comment since the overall comment is about lexing. – loganfsmyth Apr 03 '18 at 17:05
  • It is section 10 though. It states that `a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes to the literal and is never interpreted as a line terminator or as a code point that might terminate the string literal`. Yet still, it is interpreted that way. Assumingly, the lexer turns `\u000A` into a `literal token`, where the value is a line break: `(literal, [line-break])`. – Magnus Apr 03 '18 at 17:33
  • `\u000A` is not interpreted as a "line terminator". Here that has the specific meaning of the case of lexing the text. LineTerminator is a specific token: https://www.ecma-international.org/ecma-262/8.0/#prod-InputElementDiv It is not talking about the generic idea whether or not the final evaluated string's value has a `\n` or whatever in it. – loganfsmyth Apr 03 '18 at 17:38
  • Hmmm, so you are saying that during the tokenization stage: `\u000A` is not evaluated to the nonterminal symbol `LineTerminator`, which in turn ends up as the terminal symbol ``? If so, what token is it evaluated to by the lexer, and when does it turn into an actual line break? – Magnus Apr 03 '18 at 17:52
  • `\u000A` inside of a string isn't an individual token at all, it's part of the overall string literal token, and then that token may be processed to get the actual string value using SV. – loganfsmyth Apr 03 '18 at 17:53
  • Ok, thanks, I think I got it. So the lexer creates a string literal token, which includes `\u000A` un-interpreted. Then, the parser, when creating code units, turns the escape sequence into an actual line break. Thus, they can say that an escape sequence is not interpreted as a line terminator during the lexical analysis phase. Is that right? – Magnus Apr 03 '18 at 18:07
  • Pretty much, though where exactly the `\u000A` is processed doesn't really matter. It could happen inside the lexer too, it's just that the lexer doesn't could it as a LineTerminator token. The spec defines how things behave, not how they are implemented. – loganfsmyth Apr 03 '18 at 18:18
  • Actually, never mind, it seems by following the Lexical Grammar, a `StringLiteral` is broken down into individual `SingleStringCharacter`s. One of these would be `\u000A`. This, in turn, is a `\ EscapeSequence` which by following the lexical grammar turns into `\n`. Finally, that is converted to an `SV` (aka. a code unit value) via `SV of CharacterEscapeSequence::SingleEscapeCharacter is the code unit whose value is determined by the....`. The last stage is the job of the parser, not the lexer. Thus, they can say that it is not interpreted as `LineTerminator` at lexing phase. Is that right? – Magnus Apr 03 '18 at 18:18
  • Last point: If the note at `10.1` only refers to the lexing stage, it should be more clear, by saying: `...and is never interpreted as a LineTerminator during lexical analysis, or as a code point that might terminate the string literal.`. Would you agree? – Magnus Apr 03 '18 at 18:54
  • Personally I think the surrounding context makes it fairly clear. The spec is also targeted toward people implementing this logic in an engine who are likely already familiar with this type of thing. It'd be tremendously more verbose if the spec tried to target a broader audience. – loganfsmyth Apr 03 '18 at 19:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/168148/discussion-between-magnus-and-loganfsmyth). – Magnus Apr 03 '18 at 21:46