1

Given a parser

newtype Parser a = Parser { parse :: String -> [(a,String)] }

(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]

return :: a -> Parser a
return a = Parser (\s -> [(a,s)])

item :: Parser Char
item = Parser $ \s -> case cs of
                         ""     -> []
                         (c:cs) -> [(c,cs)]

We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.

Is there a reason why it would make sense to do this for a real-world problem?

References

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137

1 Answers1

3

Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)

  1. Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.

  2. Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.

  3. Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.

That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).

rici
  • 234,347
  • 28
  • 237
  • 341