Neither of the examples in the question are context-free, so strictly speaking they cannot be parsed with context-free grammars. But in practical terms, they are both pretty easy to parse.
The python algorithm is well-described in the Python reference manual (although you need to read that in context.) What's described there is a pre-processing step in which whitespace at the beginning of a line is systematically replaced with INDENT
and DEDENT
tokens.
To clarify: It's not really a preprocessing step, and it's important to observe that it happens after implicit and explicit line joining. (There are previous sections in the reference manual which describe these procedures.) In particular, lines are implicitly joined inside parentheses, braces and brackets, so the process is intertwined with parsing.
In practical terms, both the line-joining and indentation algorithms can be accomplished programmatically; typically, these would be done inside a custom scanner (tokenizer) which maintains both a stack of parentheses and indent levels. The token stream can then be parsed with normal context-free algorithms, but the tokenizer -- although it might use regular expressions -- needs context-sensitive logic (counting spaces, for example). [Note 1]
Similarly, formats which contain explicit sizes (such as most serialization formats, including PNG files, Google protobufs, and HTTP chunked encoding) are not context-free, but are obviously easy to tokenize since the tokenizer simply has to read the length and then read that many bytes.
There are a variety of context-sensitive formalisms, and these definitely have their uses, but in practical parsing the most common strategy is to use a Turing-equivalent formalism (such as any programming language, possibly augmented with a scanner-generator like flex
) for the tokenizer and a context-free formalism for the parser. [Note 2]
Notes:
It may not be immediately obvious that Python indenting is not context-free, since context-free grammars can accept some categories of agreement. For example, {ωω-1 | ω∈Σ*}
(the language of all even-length palindromes) is context-free, as is {anbn}
.
However, these examples can't be extended, because the only count-agreement possible in a context-free language is bracketing. So while palindromes are context-free (you can implement the check with a single stack), the apparently very similar {ωω | ω∈Σ*}
is not, and neither is {anbncn}
One such formalism is back-references in "regular" expressions, which might be available in some PEG implementation. Back-references allow the expression of a variety of context-sensitive languages, but do not allow the expression of all context-free languages. Unfortunately, regular expressions with back-references really suck in practice, because the problem of determining whether a string matches a regex with back-references is NP complete. You might find this question on a sister SE site interesting. (And you might want to reformulate your question in a way that could be asked on that site, http://cs.stackexchange.com.)