Megaparsec: skip space and non-alphanumeric

Question

I'm a beginner with Megaparsec and Haskell in general, and trying to write a parser for the following grammar:

A word will always be one of:

A number composed of one or more ASCII digits (ie "0" or "1234") OR

A simple word composed of one or more ASCII letters (ie "a" or "they") OR

A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")

So far, I've got the following (this can probably be simplified):

data Word = Number String | SimpleWord String | Contraction String deriving (Show)

word :: Parser MyParser.Word
word = M.choice
  [ Number <$> number
  , Contraction <$> contraction
  , SimpleWord <$> simpleWord
  ]

number :: Parser String
number = M.some C.numberChar

simpleWord :: Parser String
simpleWord = M.some C.letterChar

contraction :: Parser String
contraction = do
  left <- simpleWord
  void $ C.char '\''
  right <- simpleWord
  return (left ++ "'" ++ right)

But I'm having problem with defining a parser for skipping white spaces and anything that is non-alphanumeric. For example, given the input 'abc', the parser should discard the apostrophes and just take the "simple word". The following doesn't compile:

filler :: Parser Char
filler = M.some (C.spaceChar  A.<|> not C.alphaNumChar)

spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty

lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer

I think you might want to say what you want to skip. I have a huch the `not` in this instance wouldn't work like you want it to. It's also worth making your parsers more specific to the `data` type you are parsing. So for example `pSimpleWord = SimpleWord <$> ... `. — cstml, Jul 08 '22 at 09:18
@cstml Rewriting as `isSep x = C.isSpace x || (not . C.isAlphaNum) x` and `filler = void $ M.some (M.satisfy isSep)` compiles but doesn't skip the intended characters. — Abhijit Sarkar, Jul 08 '22 at 09:40

score 1 · Accepted Answer · answered Jul 10 '22 at 03:02

Here is the complete working code that I came up with.

type Parser =
  M.Parsec
    -- The type for custom error messages. We have none, so use `Void`.
    Void
    -- The input stream type. Let's use `String` for now.
    String
data Word = Number String | SimpleWord String | Contraction String deriving (Eq)
instance Show WordCount.Word where
  show (Number x) = x
  show (SimpleWord x) = x
  show (Contraction x) = x
words :: String -> Either String [String]
-- Force parser to consume entire input
-- <* Sequence actions, discarding the value of the second argument.
words input = case M.parse (M.some WordCount.word A.<* M.eof) "" input of
  -- :t err = M.ParseErrorBundle String Void
  Left err ->
    let e = M.errorBundlePretty err
        _ = putStr e
     in Left e
  Right (x) -> Right $ map (show) x
word :: Parser WordCount.Word
word =
  M.skipManyTill filler $
    lexeme $
      M.choice
        -- <$> is infix for 'fmap'
        [ Number <$> number,
          Contraction <$> M.try contraction,
          SimpleWord <$> simpleWord
        ]
number :: Parser String
number = M.some MC.numberChar
simpleWord :: Parser String
simpleWord = M.some MC.letterChar
contraction :: Parser String
contraction = do
  left <- simpleWord
  void $ MC.char '\''
  right <- simpleWord
  return $ left ++ "'" ++ right
-- Define separator characters
isSep :: Char -> Bool
isSep x = C.isSpace x || (not . C.isAlphaNum) x
-- Fillers fill the space between tokens
filler :: Parser ()
filler = void $ M.some $ M.satisfy isSep
-- 3rd and 4th arguments are for ignoring comments
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
-- A parser that discards trailing space
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer

Paul Johnson · Answer 2 · 2022-07-08T12:56:06.000

First, you probably want to use some1 for number and simple words, otherwise "" would be a number.

Your filler parser is good. That should use some because you want to allow for e.g. "they1234" to parse as SimpleWord "they" and Number "1234".

What you need to say for the overall parser is that your text consists of zero or more words separated by filler, with optional filler before and after. Fortunately megaparsec re-exports lots of useful stuff from Control.Monad.Combinators for doing this.

So we can use sepBy for the words separated by filler:

document :: Parser [Word]
document = do
   _ <- filler   -- Throw away any filler at the start.
   result <- word `sepBy` filler
   _ <- filler   -- Throw away any filler at the end.
   return result

We don't need optional for the start and end filler because filler can be zero length.

Finally, a style point: in a real parser you would want to make the Word type a bit more sophisticated. Something like:

data SimpleWord = Number String | SimpleWord String

data Word = Word SimpleWord | Contraction SimpleWord SimpleWord

That way whatever bit of code deals with Contraction downstream doesn't have to find the apostrophe all over again or deal with the "impossible" case where there isn't one. Once you've found the structure information in your input, don't throw it away. But that's a side issue for this exercise.

> use `some1` - I couldn't find `some1` in their docs, and `some` means at least one, so, I'm not sure why you think `""` would be accepted. Then beginning with `_ <- filler` doesn't work at all, input "abcd 123" fails with the error "unexpected 'a'". Removing that line also doesn't work, "abcd 123" fails with "unexpected space expecting ''' or letter". Have you actually tried to run your own suggestions? — Abhijit Sarkar, Jul 08 '22 at 17:40
Ahh, you also need to use [try](https://hackage.haskell.org/package/megaparsec-9.2.1/docs/Text-Megaparsec.html#v:try) before each option in `word`. — Paul Johnson, Jul 09 '22 at 18:30

Megaparsec: skip space and non-alphanumeric

2 Answers2