8

I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:

EXP  : EXP *
     | LIT EXP
     | LIT

I've tried to implement this in Haskell as:

expr = try star
       <|> try litE
       <|> lit

litE  = do c <- noneOf "*"
           rest <- expr
           return (c : rest)

lit   = do c <- noneOf "*"
           return [c]

star = do content <- expr
          char '*'
          return (content ++ "*")

There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.

Any thoughts?

Xodarap
  • 11,581
  • 11
  • 56
  • 94

2 Answers2

12

You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!

You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.

Here's my attempt (I threw in |, +, and ? for good measure):

import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr

data Term = Literal Char
          | Sequence [Term]
          | Repeat (Int, Maybe Int) Term
          | Choice [Term]
  deriving ( Show )

term :: Parser Term
term = buildExpressionParser ops atom where

  ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
          , Postfix (Repeat (1, Nothing) <$ char '+')
          , Postfix (Repeat (0, Just 1)  <$ char '?')
          ]
        , [ Infix (return sequence) AssocRight
          ]
        , [ Infix (choice <$ char '|') AssocRight
          ]
        ]

  atom = msum [ Literal <$> lit
              , parens term
              ]

  lit = noneOf "*+?|()"
  sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
  choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
  parens = between (char '(') (char ')')

  seqTerms (Sequence ts) = ts
  seqTerms t = [t]

  choiceTerms (Choice ts) = ts
  choiceTerms t = [t]

main = parseTest term "he(llo)*|wor+ld?"
pat
  • 12,587
  • 1
  • 23
  • 52
  • 2
    Wow. That's so easy it almost feels like cheating. – Xodarap Jan 27 '12 at 17:38
  • 1
    It would have been even easier if `Sequence, Choice :: Term -> Term -> Term` instead of `[Term] -> Term`, but I guess it demonstrates how to deal with an AST that doesn't exactly match the parse tree... – pat Jan 27 '12 at 17:48
6

Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:

lit :: Parser (Char, Maybe Char)
lit = do
  c <- noneOf "*"
  s <- optionMaybe $ char '*'
  return (c, s)

Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:

import Control.Applicative ((<$>))

data Term = Literal Char
          | Sequence [Term]
          | Star Term

expr :: Parser Term
expr = Sequence <$> many term

term :: Parser Term
term = do
  c <- lit
  s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
  return $ if isNothing s
    then Literal c
    else Star $ Literal c

Maybe a more experienced Haskeller will come along with a better solution.

Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
  • 1
    I'm sure you're right, but I don't understand why. It seems like the new `lit` function adds a production `EXP -> LIT *` but still keeps the left-recursive rule `EXP -> EXP *`... right? Or are you thinking I replace the `star` function with the `lit` one? – Xodarap Jan 26 '12 at 21:05
  • 1
    Well, a Kleene star only applies to the term immediately to its left, which in your code can be either a literal or a starred term, which may or may not be what you want (e.g., `a**` is redundant). Left-factoring *removes* the left recursion: `EXP -> EXP *` becomes `EXP -> LIT REST?` where `REST -> *`. You substitute one level of recursion manually and make the “tail” of the expression explicit. – Jon Purdy Jan 27 '12 at 03:34
  • Yeah, once I add in parens it won't work that way, but I see your point. I guess I'll just try to remove left recursion through the standard way and hope I can maintain my associativity. Thanks for pointing out that this was the problem. – Xodarap Jan 27 '12 at 03:53