How is the Python grammar used internally?

Question

I'm trying to get a deeper understanding of how Python works, and I've been looking at the grammar shown at http://docs.python.org/3.3/reference/grammar.html.

I notice it says you would have to change parsermodule.c also, but truthfully I'm just not following what's going on here.

I understand that a grammar is a specification for how to read the language, but...I can't even tell what this is written in. It looks almost like Python but then it isn't.

I'm looking to get a better understanding of this specification and how it is used internally by Python to....do things. What depends on it (the answer is everything, but I mean specifically which aspect of the "engine" is processing it), what uses it, how does it tie in to compiling/running a script?

It's hard to believe that the whole language comes down to a two page specification...

"It's hard to believe that the whole language comes down to a two page specification..." - that specification *only* tells you how to check whether a given piece of text is a valid chunk of Python code or not. It says nothing about how to figure out what the code actually *does*. — Karl Knechtel, Oct 14 '13 at 00:24

score 17 · Accepted Answer · answered Dec 30 '13 at 08:03

A grammar is used to describe all possible strings in a language. It is also useful in specifying how a parser should parse the language.

In this grammar it seems like they are using their own version of EBNF, where a non-terminal is any lowercase word and a terminal is all uppercase or surrounded by quotes. For example, NEWLINE is a terminal, arith_expr is a non-terminal and 'if' is also a terminal. Any non-terminal can be replaced by anything to the right of the colon of it's respective production rule. For example, if you look at the first rule:

single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE

We can replace single_input with one of either a NEWLINE, a simple_stmt or a compound_stmt followed by a NEWLINE. Suppose we replaced it with "compound_stmt NEWLINE", then we would look for the production rule for compound_stmt:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated

and choose which of these we want to use, and substitute it for "compound_stmt" (Keeping NEWLINE in it's place)

Suppose we wanted to generate the valid python program:

if 5 < 2 + 3 or not 1 == 5:
    raise

We could use the following derivation:

single_input
compound_stmt NEWLINE
if_stmt NEWLINE
'if' test ':' suite NEWLINE
'if' or_test ':' NEWLINE INDENT stmt stmt DEDENT NEWLINE
'if' and_test 'or' and_test ':' NEWLINE INDENT simple_stmt DEDENT NEWLINE
'if' not_test 'or' not_test ':' NEWLINE INDENT small_stmt DEDENT NEWLINE
'if' comparison 'or' 'not' not_test ':' NEWLINE INDENT flow_stmt DEDENT NEWLINE
'if' expr comp_op expr 'or' 'not' comparison ':' NEWLINE INDENT raise_stmt DEDENT NEWLINE
'if' arith_expr '<' arith_expr 'or' 'not' arith_expr comp_op arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' term '<' term '+' term 'or' 'not' arith_expr == arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' NUMBER '<' NUMBER '+' NUMBER 'or' 'not' NUMBER == NUMBER ':' NEWLINE INDENT 'raise' DEDENT NEWLINE

A couple of notes here, firstly, we must start with one of the non-terminals which is listed as a starting non-terminal. In that page, they list them as single_input, file_input, or eval_input. Secondly, a derivation is finished once all the symbols are terminal (hence the name). Thirdly, it is more common to do one substitution per line, for the sake of brevity I did all possible substitutions at once and started skipping steps near the end.

Given a string in the language, how do we find it's derivation? This is the job of a parser. A parser reverse-engineers a production sequence to first check that it is indeed a valid string, and furthermore how it can be derived from the grammar. It's worth noting that many grammars can describe a single language. However, for a given string, it's derivation will of course be different for each grammar. So technically we write a parser for a grammar not a language. Some grammars are easier to parse, some grammars are easier to read/understand. This one belongs in the former.

Also this doesn't specify the entire language, just what it looks like. A grammar says nothing about semantics.

If you're interested in more about parsing and grammar I recommend Grune, Jacobs - Parsing Techniques. It's free and good for self-study.

Steve Barnes · Answer 2 · 2015-05-15T07:14:07.113

The python grammar - as most others - is given in BNF or Backus–Naur Form. Try reading up on how to read it but the basic structure is:

<something> ::= (<something defined elsewhere> | [some fixed things]) [...]

This is read as a <something> is defined as something else or any of the fixed things repeated a multitude of times.

BNF is based on a nearly 2000 year old format for describing the permitted structure of a language, is incredibly terse and will describe all the allowed structures in a given language, not necessarily all those that would make sense.

Example

Basic arithmetic can be described as:

<simple arithmetic expression> ::= <numeric expr>[ ]...(<operator>[ ]...<numeric expr>|<simple arithmetic expression>)
<numeric expr> ::= [<sign>]<digit>[...][.<digit>[...]]
<sign> ::= +|-
<operator> ::= [+-*/]
<digit> ::= [0123456789]

Which says that a simple arithmetic operation is an, optionally signed, number consisting of one or more digits, possibly with a decimal point and one, or more, subsequent digits, optionally followed by spaces, followed by exactly one of +-*/, optionally followed by spaces, followed by either a number or another simple arithmetic operation, i.e. a number followed by, etc.

This describes, just about, all of the basic arithmetic operations and can be extended to include functions, etc. Notice that does allow invalid operations that are a valid syntax, e.g.: 22.34 / -0.0 is valid syntactically even though the result is not valid.

It can sometimes make you aware that operations are possible that you might not have thought of, e.g.: 56+-50 is a valid operation as is 2*-10 but 2*/3 is not.

Note that SGML and XML/Schema are both related but different methodologies for describing the structure of any language. YAML is another method for describing the allowed structures in a computer specific languages.

Disclaimer: My BNF is a little rusty so if I have made any major mistakes in the above my apologies and please correct me.

Amber · Answer 3 · 2013-10-13T22:44:37.437

5

That is basically an EBNF (Extended Backus–Naur Form) specification.

edited Oct 13 '13 at 22:44

answered Oct 13 '13 at 22:42

Amber

507,862
82
626
550

2

It's worth adding that Python has its own grammar-to-source-language translator, rather than (as is more common) using bison, byacc, or yacc to do the trick. – torek Oct 13 '13 at 22:48

Stefano Borini · Answer 4 · 2013-12-19T22:06:16.970

When you write a program in a language, the very first thing your interpreter/compiler must do in order to go from a sequence of characters to actual action is to translate that sequence of characters in a higher complexity structure. To do so, first it chunks up your program in a sequence of tokens expressing what each "word" represents. For example, the construct

if foo == 3: print 'hello'

will be converted into

1,0-1,2:    NAME    'if'
1,3-1,6:    NAME    'foo'
1,7-1,9:    OP  '=='
1,10-1,11:  NUMBER  '3'
1,11-1,12:  OP  ':'
1,13-1,18:  NAME    'print'
1,19-1,26:  STRING  "'hello'"
2,0-2,0:    ENDMARKER   ''

But note that even something like "if if if if" is correctly made into tokens

1,0-1,2:    NAME    'if'
1,3-1,5:    NAME    'if'
1,6-1,8:    NAME    'if'
1,9-1,11:   NAME    'if'
2,0-2,0:    ENDMARKER   ''

What follows the tokenization is the parsing into a higher level structure that analyzes if the tokens actually make sense taken together, something that the latter example does not, but the first does. To do so, the parser must recognize the actual meaning of the tokens (e.g. the if is a keyword, and foo is a variable), then build a tree out of the tokens, organizing them in a hierarchy and see if this hierarchy actually makes sense. Here is where the grammar you are seeing comes in. That grammar is in BNF, which is a notation to express the constructs the language can recognize. That grammar is digested by a program (for example, bison) which has the magic property of taking that grammar and generate actual C code that does the heavy work for you, normally by recognizing the tokens, organizing them, returning you a parse tree, or tell you where there's a mistake.

Short version: developing a language is about defining tokens and how these tokens are put together to give something meaningful. This is done through the grammar, which you use to generate the actual "parser" code with automated tools.

@Aerovistae: No, it's two tuples of two: (line, column) - (line, column) — Stefano Borini, Dec 20 '13 at 16:32
No, `if if if if` is not correctly made into tokens; the grammar doesn't allow it. That would give a parse error. — , Jan 07 '16 at 17:38
Interestingly enough, the lexer will accept it, and it will be tokenized correctly, but it won't be correctly converted into any kind of syntax tree. — iptq, Nov 29 '17 at 18:28

How is the Python grammar used internally?

4 Answers4

Example

Linked