ECMAScript 2017: Parsing from nonterminal StringLiteral to String values

Question

I am trying to understand the translation of a string literal to a final String value (consisting of code unit values), following ECMAScript 2017.

Relevant Excerpts

5.1.2 The Lexical and RegExp Grammars

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.

5.1.4 The Syntactic Grammar

When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.

and

11 ECMAScript Language: Lexical Grammar

The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.

11.8.4 String Literals

StringLiteral ::
    " DoubleStringCharacters_opt "
    ' SingleStringCharacters_opt '

SingleStringCharacters ::
    SingleStringCharacter SingleStringCharacters_opt

SingleStringCharacter ::
    SourceCharacter but not one of ' or \ or LineTerminator
    \ EscapeSequence
    LineContinuation

EscapeSequence ::
    CharacterEscapeSequence
    0 [lookahead ∉ DecimalDigit]
    HexEscapeSequence
    UnicodeEscapeSequence

CharacterEscapeSequence ::
    SingleEscapeCharacter
    NonEscapeCharacter

NonEscapeCharacter ::
    SourceCharacter but not one of EscapeCharacter or LineTerminator

EscapeCharacter ::
    SingleEscapeCharacter
    DecimalDigit
    x
    u

11.8.4.3 Static Semantics: SV

A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.

and

The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.

The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.

Question

Assume we have string literal 'b\ar'. I now want to follow the above lexical grammar and semantic grammar, to turn the string literal into a set of code unit values.

b\ar is recognized as a CommonToken
b\ar is further recognized as a StringLiteral
StringLiteral is translated to SingleStringCharacters
Each code point in SingleStringCharacters is translated to SingleStringCharacter
Each SingleStringCharacter without a \ infront is translated to a SourceCharacter
\a is recognized as \ EscapeSequence
EscapeSequence (a) is translated to NonEscapeCharacter
NonEscapeCharacter is translated to SourceCharacter
All SourceCharacter's are translated to any Unicode code point
Finally, the SV rules are applied to get string values and thus code unit values

The problem I have is that the StringLiteral input element is now:

SourceCharacter, \ SourceCharacter, SourceCharacter

There is no SV rule for \ SourceCharacter, only for \ EscapeCharacter.

Which makes me wonder if I have the order wrong, or misunderstood how the lexical and syntactic grammar is applied.

I am also confused about how the SV rules are applied altogether. Because they are defined to apply to nonterminal symbols, as opposed to terminal symbols (which should be the result after the lexical grammar has been applied).

Any help is deeply appreciated.

*“NonEscapeCharacter is translated to SourceCharacter”* Translated? Where did you get this? — Ry-, Apr 04 '18 at 02:56
Hi @Ryan. I added an excerpt at the top, clarifying where the term translate came from. As far as I understand, the lexical grammar is applied (often recursively) to nonterminal symbols, until only terminal symbols (aka. tokens / input elements) remain. — Magnus, Apr 04 '18 at 03:03
Under 11.8.4: `NonEscapeCharacter :: SourceCharacter but not one of EscapeCharacter or LineTerminator` — Magnus, Apr 04 '18 at 03:05
I believe this: `SourceCharacter, \ SourceCharacter, SourceCharacter` to be in error. It should be this: `SourceCharacter, SingleEscapeCharacter, SourceCharacter, SourceCharacter`. — Randy Casburn, Apr 04 '18 at 03:05
That doesn’t mean it’s not a NonEscapeCharacter (or, in turn, not a CharacterEscapeSequence and not an EscapeSequence), though. The ``\`` EscapeCharacter rule applies and directs you to the SV “The SV of CharacterEscapeSequence :: NonEscapeCharacter is the SV of the NonEscapeCharacter”. — Ry-, Apr 04 '18 at 03:06
There should be a step-by-step algorithm of how it goes from nonterminal symbols to code unit values. It's not like we can just choose where to stop the lexing process, on a case by case basis (e.g. at `\ EscapeCharacter`). As far as I understand, lexing happens first, using lexical grammar, to turn it all into terminal symbols. Then the semantic analysis is applied. — Magnus, Apr 04 '18 at 03:10
@RandyCasburn It wouldn't be a `SingleEscapeCharacter`, because it is `\a` and `a` is a `NonEscapeCharacter`. — Magnus, Apr 04 '18 at 03:12
CharacterEscapeSequence:: SingleEscapeCharacter NonEscapeCharacter ^--- that is what the grammar says `/a` is a CharacterEscapeSequence that the lexer tokenizes as a SingleEscapeCharacter (`/`) followed by a NonEscapeCharacter - the NonEscapeCharacter is then tokenized as a SourceCharacter. — Randy Casburn, Apr 04 '18 at 03:17
@RandyCasburn it is not both `SingleEscapeCharacter` and `NonEscapeCharacter`, it is either or. See 5.1.5 Grammar Notation: https://www.ecma-international.org/ecma-262/8.0/index.html#sec-grammar-notation. That is not relevant to the question however. We can continue that piece in chat, if you want. — Magnus, Apr 04 '18 at 03:23

loganfsmyth · Accepted Answer · 2018-04-04T17:16:01.807

Alright, assuming we're going in with a single token 'b\ar', which is as you've said a StringLiteral token. Applying the algorithm defined in 11.8.4.3 Static Semantics: SV as well as 10.1.1 Static Semantics: UTF16Encoding(cp), we follow the SV rules:

The SV of StringLiteral:: ' SingleStringCharacters ' is the SV of SingleStringCharacters.
- Unwrap the quotes since we're recursively running SV on just the SingleStringCharacters part, e.g. SV(b\ar)
The SV of SingleStringCharacters:: SingleStringCharacterSingleStringCharacters is a sequence of one or two code units that is the SV of SingleStringCharacter followed by all the code units in the SV of SingleStringCharacters in order.

This says "call SV every SingleStringCharacter appending results".
1. SV(b)
  1. The SV of SingleStringCharacter:: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.
    - The codepoint "b" is codeunit \x0062 so the result here is essentially a code unit sequence of a single 16-bit unit \x0062
2. SV(\a)
  1. The SV of SingleStringCharacter:: \ EscapeSequence is the SV of the EscapeSequence.
    - Essentially SV(EscapeSequence) this SV(a) (no \ prefix)
  2. The SV of EscapeSequence:: CharacterEscapeSequence is the SV of the CharacterEscapeSequence.
    - Basically just passing through SV(a)
  3. The SV of CharacterEscapeSequence:: NonEscapeCharacter is the SV of the NonEscapeCharacter.
    - More pass-through
  4. The SV of NonEscapeCharacter:: SourceCharacter but not one of EscapeCharacter or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.
    - The codepoint "a" is code unit \x0061, so this results in a single-unit sequence of just \x0061.
3. SV(r)
  - Following the same steps as for SV(b) this results a single-unit sequence containing \x0072.
Merging the sequence SV(b) + SV(\a) + SV(r) back together, the value of the string is the sequence of UTF16 code units [\x0062, \x0061, \x0072]. That sequence of code units results in bar.

Edit:

I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?

The "token" from a lexer's standpoint, is StringLiteral, everything within that is just information on how to parse. EscapeSequence is not a type of token.

SV defines how to break down the StringLiteral token into a sequence of code units.

As states in 11 ECMAScript Language: Lexical Grammar

The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.

These "input elements" are the tokens used by the parser grammar.

Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?

There's more than just the value, there is also the datatype. Using Flow/Typescript-style annotations, you can kind of think of the steps above for

The SV of SingleStringCharacter:: \ EscapeSequence is the SV of the EscapeSequence.
The SV of EscapeSequence:: CharacterEscapeSequence is the SV of the CharacterEscapeSequence.
The SV of CharacterEscapeSequence:: NonEscapeCharacter is the SV of the NonEscapeCharacter.
The SV of NonEscapeCharacter:: SourceCharacter but not one of EscapeCharacter or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.

as if it were an overloaded function, e.g.

function SV(parts: ["\", EscapeSequence]) {
    return SV(parts[1]);
}
function SV(parts: [CharacterEscapeSequence]) {
    return SV(parts[0]);
}
function SV(parts: [NonEscapeCharacter]) {
    return SV(parts[0]);
}
function SV(parts: [SourceCharacter]) {
    return UTF16Encoding(parts[0]);
}

So SV(a) would be kind of like SV("a": [CharacterEscapeSequence]) whereas SV(b) has a different type.

Thanks a lot, Logan. My first question is around the order of things. You apply the SV algorithm right away. I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules? Of course, I might have understood that completely wrong. To contradict myself, I feel like I have read somewhere that every application of a Lexical Grammar production results in run of SV(), thus outputting a String value. — Magnus, Apr 04 '18 at 09:39
Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no? Is it necessary with the rest of the steps for that SV(a)? I.e. why do we go through `CharacterEscapeSequence` etc.. assumingly the algorithm then only sees SV(a). — Magnus, Apr 04 '18 at 09:43
That clarifies it for me Logan, thanks as always. Is there a statement in the spec stating this (or something similar): `There's more than just the value, there is also the datatype.` ? If not, where did you get that from? — Magnus, Apr 04 '18 at 22:15
SV wouldn't be defined the way it is if that wasn't how it worked. It specifically lists out the behavior of the algorithm for each type of value. If the token was broken down first before SV ran, it wouldn't make sense to define it that way. — loganfsmyth, Apr 04 '18 at 22:18
As a side point, I found this on Wikipedia: https://en.wikipedia.org/wiki/Lexical_analysis#Evaluator. It says that a lexer first goes through each character to see which character sequences make up valid tokens (scanner phase), then the evaluator phase goes over each lexeme to produce a value, and thus create a finished token. Key point: If the evaluator sees an escape sequence in a string literal, it will run a lexer on it (so the lexer runs a lexer) to unescape the escape sequence. When unescaped, it removes the quotes and passes the literal token to the parser. — Magnus, Apr 04 '18 at 22:34
The description from Wikipedia (my previous comment) does not really tie in with how ECMAScript does it, unfortunately. ECMAScript says: Create the StringLiteral token (which may include escape sequences), then apply the SV algorithm on it to output a stream of code units. The Wikipedia description says: Escape sequences are removed before a token is created. The finished token is then passed to the parser which assumingly will evaluate the validity of the code and compile it all to code units. — Magnus, Apr 04 '18 at 22:45
The spec does not usually specific _how_ to implement something, just how things should behave. The specifics of how a lexer might work, or if the lexer is even a separate thing is entirely up to implementers. — loganfsmyth, Apr 04 '18 at 22:54

ECMAScript 2017: Parsing from nonterminal StringLiteral to String values

Relevant Excerpts

Question

1 Answers1

Edit: