How to disallow blanks between tokens?

Question

In bash, there must be no spaces around = in the assigment.

x=10

bash's yylex() just returns the whole thing x=10 as an ASSIGNMENT_WORD token. Then do the processing.

http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n723

But is it better handle the assignment by the parser instead of the lexer (as the most examples about the assignment that I see handle it in the parser)?

How to design the grammar to disable spaces around =? Thanks.

It really depends on what you’re trying to do. Bash disallows spaces in assignments but allows them in other parts so it _must_ have a way to differentiate between `x = 10` and `x=10`, something that can only be done at the lexing step. — bfontaine, Feb 14 '19 at 17:19

score 2 · Answer 1 · answered Feb 14 '19 at 17:32

2

But is it better handle the assignment by the parser instead of the lexer (as the most examples about the assignment that I see handle it in the parser)?

If whitespace generally serves to separate lexical tokens, then it is probably easier to avoid making an exception for variable assignments. Note that in the particular case of the shell command language, the specifications explicitly describe parsing in terms of lexical tokenization having been done that way, so in that case it is particularly appealing to write your code so that it corresponds directly to the specifications.

If you do that, then it is very helpful to also let the lexer recognize assignments as their own token type, as that information falls out of the lexical analysis pretty cheaply, whereas it would be messier and more expensive for the parser stage to (re-)check each token value to recognize which are assignments.

How to design the grammar to disable spaces around =?

If the whitespace rules are to be applied at the grammar level, as opposed to the tokenization level, then the tokenizer needs to emit explicit whitespace tokens so that the parser sees where the whitespace is. You can then write grammar rules that accommodate whitespace where it is allowed, and not where it isn't. But take it from someone who has done that (for a different language): it's ugly and nasty, and you should make every effort to avoid it.

answered Feb 14 '19 at 17:32

John Bollinger

160,171
8
81
157

But spaces are allowed in the assignment between `((` and `))` (for math). So the tokenization will be different in the math mode vs non math mode. Currently, bash use `ARITH_CMD` to handle everything between `((` and `))` http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n349. Is it the reason why ARITH_CMD is capture at the lexical analysis level instead of the grammar level? – user1424739 Feb 14 '19 at 22:14
Also, why it's ugly and nasty? It seems that parse it as WORD first. Then, post process it is no good either. See for example, http://git.savannah.gnu.org/cgit/bash.git/tree/subst.c#n941. Consider this, it seems to me tokenize blanks and process them in the grammar level should be a viable solution. And it seems that it will be more flexible to make future changes to add additional features to the language. Is it? – user1424739 Feb 15 '19 at 02:45
@user1424739, not being among Bash's developers, I cannot tell you why they made the design decisions they did. However, that arithmetic expansions having easy to recognize boundaries and their own special sublanguiage does make them an attractive target for being farmed out to a separate module for parsing. – John Bollinger Feb 15 '19 at 05:56
As for emiting whitespace tokens to the parser, it is ugly and nasty because then all your grammar needs to account for *all* the whitespace. It requires a much more complex grammar. Avoiding all that is one of the more useful things we usually gain from separating lexical analysis from grammar. – John Bollinger Feb 15 '19 at 06:00
How to make separate module for different subset of a language? In bison, everything must be in the same .y file. How to switch between different lexers and parsers for different subset of the language? Is there a minimal working example to demonstrate this? – user1424739 Feb 15 '19 at 10:59
@user1424739 you should ask this as a different question. – bfontaine Feb 18 '19 at 17:37

How to disallow blanks between tokens?

1 Answers1