0

I currently knows how Zend parse operators by reading the Zend/zend_language_parser.y file of php-src. But I'm very confusing about how variables are recognized.

The Bison token is:

%token <ast> T_VARIABLE  "variable (T_VARIABLE)"

How does it match the dollar prefix?

Gavin Kwok
  • 98
  • 6
  • php-src v7.1.16 – Gavin Kwok Aug 21 '18 at 10:57
  • I assume using a very naive regular expression, which leads to `$™` being a valid variable name sheesh – Dale Aug 21 '18 at 10:58
  • A `zend_language_scanner.l` file is included in the `MakeFile` to generate `zend_language_scanner.c`. A `zend_language_parser.y` file is included in the `MakeFile` to generate `zend_language_parser.c`. The `zend_language_scanner.l` includes both `zend_language_parser.h` and `zend_language_scanner.h`. Also, it defines where to match a 'keyword' like `$` and return a token such as `T_VARIABLE` when it matches. And those token is defined in `zend_language_parser.y`. – Gavin Kwok Aug 22 '18 at 02:23

1 Answers1

2

The token declaration tells us that there's a token type named T_VARIABLE that is associated with values of type ast and should be referred to as "variable (T_VARIABLE)" in error messages. It tells us nothing about which characters a T_VARIABLE token may consist of - nothing in the Bison file will tell us that.

That's because a Bison parser does not interact with characters - it interacts with tokens produced by the lexer/scanner. The parser simply consumes the tokens generated by the scanner. It does not need to know which character sequence are translated to which tokens - that's the scanner's job.

So if you want to see the dollar sign, you need to look into the scanner (zend_language_scanner.l) where you'll find (among others) this:

<ST_IN_SCRIPTING,ST_DOUBLE_QUOTES,ST_HEREDOC,ST_BACKQUOTE,ST_VAR_OFFSET>"$"{LABEL} {
    RETURN_TOKEN_WITH_STR(T_VARIABLE, 1);
}

This tells us that inside regular PHP sections, double quotes, heredocs, back quotes and brackets (i.e. basically anywhere except outside of the <?php tags), a dollar followed by a label (which is defined as an arbitrary non-empty sequence of letters, numbers and underscores that doesn't start with a number) produces a T_VARIABLE token.

sepp2k
  • 363,768
  • 54
  • 674
  • 675