Parse many terms not followed by symbol

Question

I am attempting to develop a lambda calculus interpreter, which supports definitions of terms. For that, I'm using the Megaparsec library in Haskell.

I have the following grammar:

term := var | lambda var . term | term term
def := var = term
instruction := :e t | def

My parser should return a list of instructions, where :e t will evaluate to whatever t is reduced, and def will allow me to have named definitions.

So far, I've managed to parse a very similar grammar, except I'm using semicolons to separate each instruction:

...
instruction := :e t; | def;

I'd like to remove the need for those semicolons, but I'm having no luck with that. My code is currently:

sc :: Parser ()
sc = L.space space1 (L.skipLineComment "--") (L.skipBlockComment "{-" "-}")

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

semicolon :: Parser Text
semicolon = symbol ";"

lowercaseWord :: Parser [Char]
lowercaseWord = some lowerChar

variable :: Parser Term
variable = lexeme lowercaseWord <&> Var

lamAbs :: Parser Term
lamAbs = do
  symbol "lambda"
  space
  v <- lexeme lowercaseWord
  space
  symbol "."
  space
  t <- term
  space
  return $ Abs v t

term :: Parser Term
term = foldl1 App <$> sepBy1 (parens term <|> try lamAbs <|> variable) sc
  where parens = between (symbol "(") (symbol ")")

definition :: Parser (Instruction Term)
definition = do
  n <- lexeme lowercaseWord
  space
  symbol "="
  space
  t <- term
  return $ IDef n t

commandEval :: Parser (Instruction Term)
commandEval = do
  symbol ":e"
  t <- term
  optional newline
  return $ IEval t

program :: Parser [Instruction Term]
program = sc *> sepEndBy (definition <|> commandEval) semicolon <* eof

parse :: ProgramParser (Instruction Term)
parse text = case runParser program "" text of
  Left err  -> Left $ errorBundlePretty err
  Right ast -> Right ast

When I attempt to remove the need for semicolons by changing the definition of program to:

program = sc *> some (definition <|> commandEval) <* eof

the parser will not be able to figure out that it needs to stop parsing a term once it finds something like var = ..., to continue parsing the next definition.

For example, the following input:

a = (lambda x . x)
    x

:e b a

b = lambda y . y

returns the following error:

6:3:
  |
6 | b = lambda y . y
  |   ^
unexpected '='
expecting ":e", "lambda", '(', end of input, lowercase letter, or newline

I tried to change the definition of term to make it explicit that it should not be followed by a lowercase word and a =. But that didn't work either:

term = foldl1 App <$> sepBy1 ((parens term <|> try lamAbs <|> variable) <* notFollowedBy newdef) sc
  where parens = between (symbol "(") (symbol ")")
        newdef = (lexeme lowercaseWord *> space) *> symbol "="

How can I make it so that I don't have the need for semicolons, and my parser automatically stops parsing a term application once it finds var = term?

Thank you in advance for your help.

I never used MegaParsec but just Parsec and I think both are very similar. I remember getting many unexpected errors, because I was parsing spaces when `lexeme` and `symbol` was already parsing them for me - your parsers just does that. I think you can get rid of semicolons after that. Also I strongly recommend writing tests as they ensure your parser won't regress. Happy coding. — Poselsky, Jun 17 '23 at 08:56

K. A. Buhr · Accepted Answer · 2023-06-17T17:58:04.367

As noted by @Poselsky, the point of using Text.Megaparsec.Char.Lexer is that you don't need (and shouldn't have) to deal with white space in your higher-level parsers. So, I'd start by deleting every use of space in lamAbs and definition, deleting the optional newline from commandEval and replacing sepBy1 xxx sc with some xxx in term. (In all these cases, these parsers perform no function. They are run on the input stream after the previous lexeme parser has already absorbed all trailing whitespace.)

I'd also move the lexeme call into lowercaseWord:

lowercaseWord :: Parser [Char]
lowercaseWord = lexeme (some lowerChar)

Then, except for your primitive parsers (now, symbol and lowercaseWord), the only place you should need to deal with whitespace is the sc at the start of your top-level parser program. Every other parser should be assumed to start at non-whitespace and absorb trailing whitespace, and those parsers can be combined in sequence to create another parser that follows the same rule, so no stray space or sc or newline calls are needed.

That will simplify your code, though it won't fix your problem.

Your attempted fix is the right idea. The problem is that it can't parse the last variable before a newdef because that variable is followed by a newdef.

You don't actually need to try to parse a whole definition. All you really need to do is make sure a variable isn't followed by a symbol "=", and you need to wrap it in a try block to backtrack on the name parsed by variable:

term :: Parser Term
term = foldl1 App <$> some (parens term <|> try lamAbs <|>
                            try (variable <* notFollowedBy (symbol "=")))
  where parens = between (symbol "(") (symbol ")")

Do be warned that there are still some problems with your parser. It accepts the following input, for example, which you might not expect:

x = lambdalicious. man

(Note that this has nothing to do with removing the space calls -- your original parser parsed this, too.)

The problem is that symbol "lambda" can match the prefix of a larger word, so "lambdalicious" is parsed as "lambda licious". There's a similar issue with symbol ":e", if you don't want ":eval" to parse as ":e val".

The usual solution is to define:

reserved :: Text -> Parser Text
reserved s = lexeme $ try (string s <* notFollowedBy lowerChar)

and then the following work as intended:

reserved "lambda"
reserved ":e"

Here's my full test program with all the above revisions:

{-# LANGUAGE OverloadedStrings #-}

module LambdaParser where

import Data.Void

import Data.Functor
import Data.Text (Text)
import Text.Megaparsec hiding (parse)
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L

type Parser = Parsec Void Text

data Term = Var String | Instruction Term | Abs String Term | App Term Term deriving (Show)
data Instruction t = IDef String t | IEval Term deriving (Show)

sc :: Parser ()
sc = L.space space1 (L.skipLineComment "--") (L.skipBlockComment "{-" "-}")

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

reserved :: Text -> Parser Text
reserved s = lexeme $ try (string s <* notFollowedBy lowerChar)

semicolon :: Parser Text
semicolon = symbol ";"

lowercaseWord :: Parser [Char]
lowercaseWord = lexeme (some lowerChar)

variable :: Parser Term
variable = lowercaseWord <&> Var

lamAbs :: Parser Term
lamAbs = do
  reserved "lambda"
  v <- lowercaseWord
  symbol "."
  t <- term
  return $ Abs v t

term :: Parser Term
term = foldl1 App <$> some (parens term <|> try lamAbs <|>
                            try (variable <* notFollowedBy (symbol "=")))
  where parens = between (symbol "(") (symbol ")")

definition :: Parser (Instruction Term)
definition = do
  n <- lowercaseWord
  symbol "="
  t <- term
  return $ IDef n t

commandEval :: Parser (Instruction Term)
commandEval = do
  reserved ":e"
  t <- term
  return $ IEval t

program :: Parser [Instruction Term]
program = sc *> some (definition <|> commandEval) <* eof

parse :: Text -> Either String [Instruction Term]
parse text = case runParser program "" text of
  Left err  -> Left $ errorBundlePretty err
  Right ast -> Right ast

main = do
  let ex1 = "a = (lambda x . x)\n   x\n\n:e b a\n\nb = lambda y . y"
  case parse ex1 of
    Left err -> putStrLn err
    Right pgm -> print pgm

Your suggestions worked perfectly. And also, thank you for letting me know about the problem with reserved keywords. — JPYamamoto, Jun 19 '23 at 17:19

Parse many terms not followed by symbol

1 Answers1