Pushing back tokens in Happy and Alex

Question

I'm parsing a language which has both < and <<. In my Alex definition I've got something that contains something like

tokens :-

"<"             { token Lt }
"<<"            { token (BinOp Shl) }

so whenever I encounter <<, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2 and 1 << 2. However, there are other times I wish << had been read as two <. For example, I have things like

<<A>::B>

which I want read like

< < A > :: B >

Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<") when I encountered << but I only needed <).

Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with << by just making sure there is no whitespace between the two <).

score 2 · Answer 1 · answered Nov 10 '16 at 23:51

I don’t know how to express this in Happy, but you don’t need a separate “whitespace” token. You can parse < or > as a distinct “angle bracket” token when immediately followed by an operator symbol in the input, with no intervening whitespace.

Then, when you want to parse an operator, you join a sequence of angles and operators into a single token. When you want to treat them as brackets, you just deal with them separately as usual.

So a << b would be tokenised as:

identifier "a"
left angle      -- joined with following operator
operator "<"
identifier "b"

When parsing an operator, you concatenate angle tokens with the following operator token, producing a single operator "<<" token.

<<A>::B> would be tokenised as:

left angle
operator "<"    -- accepted as bracket
identifier "A"
right angle
operator "::"
identifier "B"
operator ">"    -- accepted as bracket

When parsing angle-bracketed terms, you accept both angle tokens and </> operators.

This relies on your grammar not being ambiguous wrt. whether you should parse an operator name or a bracketed thing.

Smart idea! Although I haven't yet managed to find a way to get two tokens by Alex, I can see how this will work. Thanks! — Alec, Nov 13 '16 at 11:28
@Alec: Yeah, I’ve only ever used this technique in Parsec, where you have lookahead (e.g. `try (Operator <$> char '<' <* notFollowedBy symbol) <|> LeftAngle <$> char '<'`), but that doesn’t look possible to express directly in Alex. You could probably encode it without lookahead somehow. — Jon Purdy, Nov 13 '16 at 19:35

score 0 · Accepted Answer · answered Mar 09 '17 at 03:47

While I initially went with @Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr vs expr << expr) which caused me a lot of headaches. I recently (successfully) back to lexing << as one token. The solution was twofold:

I bit the bullet and added extra rules for << (where previously I only had rules for <). For the example in the question (<<A>::B>) my rule went from something like
```
ty_qual_path
  : '<' ty_sum '>' '::' ident
```
to
```
ty_qual_path
  : '<' ty_sum '>' '::' ident
  | '<<' ty_sum '>' '::' ident '>' '::' ident
```
(The actual rule was actually a bit more involved, but that is not for this answer).
I found a clever way to deal with token that started with > (these would cause problems around things like vector<i32,vector<i32>> where the last >> was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... } RHS of rules which lets you reconsider the lookahead token, and add a pushToken facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like
```
gt :: { () }
  : {- empty -}   {%% \tok ->
      case tok of
        Tok ">>"  -> pushToken (Tok ">")  *> pushToken (Tok ">")
        Tok ">="  -> pushToken (Tok "=")  *> pushToken (Tok ">")
        Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
        _         -> pushToken tok
    }
```
And every time in some other rule I expected a > but there could also be any other token starting with >, I would precede the > token with gt. This has the effect of looking ahead to the next token which may could start with > without being >, and try to convert that token into one > token and another token for the "rest" of the initial token.

Pushing back tokens in Happy and Alex

2 Answers2