I am trying to understand the translation of a string literal to a final String value (consisting of code unit values), following ECMAScript 2017.
Relevant Excerpts
5.1.2 The Lexical and RegExp Grammars
A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.
Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.
5.1.4 The Syntactic Grammar
When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.
and
11 ECMAScript Language: Lexical Grammar
The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
11.8.4 String Literals
StringLiteral ::
" DoubleStringCharacters_opt "
' SingleStringCharacters_opt '
SingleStringCharacters ::
SingleStringCharacter SingleStringCharacters_opt
SingleStringCharacter ::
SourceCharacter but not one of ' or \ or LineTerminator
\ EscapeSequence
LineContinuation
EscapeSequence ::
CharacterEscapeSequence
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
UnicodeEscapeSequence
CharacterEscapeSequence ::
SingleEscapeCharacter
NonEscapeCharacter
NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator
EscapeCharacter ::
SingleEscapeCharacter
DecimalDigit
x
u
11.8.4.3 Static Semantics: SV
A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.
and
The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.
The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.
Question
Assume we have string literal 'b\ar'
. I now want to follow the above lexical grammar and semantic grammar, to turn the string literal into a set of code unit values.
b\ar
is recognized as a CommonTokenb\ar
is further recognized as a StringLiteral- StringLiteral is translated to SingleStringCharacters
- Each code point in SingleStringCharacters is translated to SingleStringCharacter
- Each SingleStringCharacter without a
\
infront is translated to a SourceCharacter \a
is recognized as \ EscapeSequence- EscapeSequence (a) is translated to NonEscapeCharacter
- NonEscapeCharacter is translated to SourceCharacter
- All SourceCharacter's are translated to
any Unicode code point
- Finally, the SV rules are applied to get string values and thus code unit values
The problem I have is that the StringLiteral input element is now:
SourceCharacter, \ SourceCharacter, SourceCharacter
There is no SV rule for \ SourceCharacter, only for \ EscapeCharacter.
Which makes me wonder if I have the order wrong, or misunderstood how the lexical and syntactic grammar is applied.
I am also confused about how the SV rules are applied altogether. Because they are defined to apply to nonterminal symbols, as opposed to terminal symbols (which should be the result after the lexical grammar has been applied).
Any help is deeply appreciated.