Are parsing expression grammars suited to parsing the shell command language?

Question

The POSIX shell command language is not easy to parse, largely because of tight coupling between lexing and parsing.

However, parsing expression grammars (PEGs) are often scannerless. By combining lexing and parsing, it seems that I could avoid these problems. The language that I am using (Rust) has a well-maintained PEG library. However, I know of three difficulties that could make it impractical to use this library:

Shells must be able to parse line by line, not reading characters past the end of the line.
Aliases are purely lexical, and can cause a token to be replaced by any sequence of other tokens in certain situations
Shell reserved words are only recognized in certain situations

Is a PEG suited to parsing the shell command language given these requirements, or is a hand-written recursive-descent parser more suitable?

FWIW, bash uses a fairly straightforward bison-generated parser, combined with an extremely complicated handwritten lexer. I have no idea how well PEG would work, but if you give it a try, let us know. — rici, Mar 09 '15 at 03:37
Three reasons: it is GPL while my shell is under MIT/Apache 2, it is in C while my shell is in Rust, and I would learn nothing from it. — Demi, Mar 09 '15 at 04:42
Yes. PEG parsers do scanning. The grammar language is more powerful than regular expressions, and as compact and convenient. I've translated several ANTLR grammars to Grako (PEG), and the lexical part has translated easily. PEG will be less efficient than a state-machine based lexer, though. — Apalala, Mar 09 '15 at 16:40
@Apalala I do not just mean for lexing; I mean for parsing too. — Demi, Mar 09 '15 at 16:59
This question is probably a better fit for Programmers Stack Exchange than Stack Overflow. As an SO question, it seems too broad and too much of an opinion poll. YMMV. — Todd A. Jacobs, Sep 28 '15 at 05:26

cliffordheath · Accepted Answer · 2015-11-04T05:28:31.453

4

Yes, a PEG can be used, and none of the issues you note should be a problem. In particular:

1) parsing line by line: most PEG tools will not have any built-in white-space skipping. All white space including newlines must be explicitly handled by you, which means you can handle newline any way you like.

2) You should not use the parse tree from PEG as your AST. Instead you should descend the parse tree and build an AST. For aliases then, after the parse has completed and you're building your AST, you can detect the alias and insert the appropriate expansion for the alias instead.

3) Reserved words are not reserved unless you reserve them. That is, if you have a context where either a reserved word or another alphanumeric symbol can occur, you must first check for the reserved words explicitly, then the arbitrary alphanumeric symbol, because once the PEG decides it has a match, that will not back-track. Anywhere a reserved word is not permitted, simply don't check for it, and your generalised alphanumeric symbol rule will succeed instead.

edited Nov 04 '15 at 05:28

answered Nov 03 '15 at 05:07

cliffordheath

2,536
15
16

I'm a little off my turf, but I read the word "alias" as "parameterless macro". Who says a macro expansion has to form a phrase in the grammar you provide? If it does not, you can't just a "tree replacement". (Frankly, these are easily handled by simply expanding them when the lexer encounters them). – Ira Baxter Nov 03 '15 at 05:50
@Ira: Traditional shell aliases are basically textual substitutions - any following text on the invocation is parsed as part of the expansion. So while what you say may be true of aliases in other languages, or in more advanced shells, a textual replacement will almost always be correct. Further: the AST is not a parse tree, as I already said. You do whatever replacement creates the right alias semantics. – cliffordheath Nov 04 '15 at 03:11
"Almost always?" given the string " if (pqr abc" with pqr being an alias of "a>b)", how can you parse the string and then substitute the alias later? – Ira Baxter Nov 04 '15 at 03:51
You can't, and the shells don't. By almost always, I mean "in most contexts where an alias is legal". Your example is not legal in any shell I've used. I said "most" because I'm not sure; the OP should check. In any case, I'm done arguing about it, because this has nothing to do with the question as asked. – cliffordheath Nov 04 '15 at 03:57
It has everything to do with your answer. Unless the shell language insists that "aliases" can only occur where the grammar for the shell language allows only a single terminal or nonterminal, your solution simply doesn't work. I've only seen one langauge where that was true (because I designed the language that way on purpose). My example is essentially legal in every other macro language I've encountered. – Ira Baxter Nov 04 '15 at 05:10
1

A shell alias is not a macro, parameterless or otherwise, so your argument doesn't apply. Shell aliases are only detected and expanded where a command is valid. However, I may have gone too far in suggesting a simple AST substitution, so I generalised my answer above. Thanks for pointing out the possible misunderstanding. – cliffordheath Nov 04 '15 at 05:33

Are parsing expression grammars suited to parsing the shell command language?

1 Answers1