Mixing Parser Char (lexer?) vs. Parser String

Question

I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.

Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char (ie Text.Megaparsec.Char) vs. Parser String?

Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?

`charToStr = fmap (: [])` seems like a nasty hack to upgrade a `Parser Char` to `Parser String`. — , Aug 12 '18 at 22:53
Nonterminal is a rule which is contains possibly nonterminals or terminals (tokens). `parserTest foo "some interesting input"` then `foo` is both a nonterminal (parser) and expectation placed on the input. Expectation maybe used as a synonym for parse rule in this Q. — , Aug 12 '18 at 23:49

score 3 · Accepted Answer · answered Aug 12 '18 at 23:42

Typically, one uses applicative operations to directly combine Parser Char and Parser Strings, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:

ident :: Parser String
ident = (:) <$> letterChar <*> alphaNumChar

If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:

dollars :: Parser String
dollars = (:) <$> char '$' <*> some digitChar
          <**> pure (++)
          <*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)

If you find yourself trying to build a Parser String out of a complicated sequence of Parser Char and Parser String parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>) and a short-form for charToStr like c :: Parser Char -> Parser String.

(<.+>) :: Parser Char -> Parser String -> Parser String
p <.+> q = (:) <$> p <*> q
infixr 5 <.+>

(<++>) :: Parser String -> Parser String -> Parser String
p <++> q = (++) <$> p <*> q
infixr 5 <++>

(<..>) :: Parser Char -> Parser Char -> Parser String
p <..> q = p <.+> fmap (:[]) q
infixr 5 <..>

so you can write something like:

dollars' :: Parser String
dollars' = char '$' <.+> some digitChar 
           <++> option "" (char '.' <.+> digitChar <..> digitChar)

As @leftroundabout says, there's nothing hackish about fmap (:[]). If you prefer, write fmap (\c -> [c]) if you think it looks clearer.

Thanks for answering. Custom operators seem the best way to get a cleaner EDSL. — , Aug 12 '18 at 23:51

score 2 · Answer 2 · answered Aug 12 '18 at 23:31

2

There's nothing nasty or hackish about fmap (: []) (or fmap pure or pure <$>) – it's the natural thing to do, performing a conversion that's concise, safe, expressive and transparent all at the same time.

An alternative that I wouldn't really recommend, but for some situations it might express the intent best: sequence [charParser]. This makes it clear that you're executing “all” of the parsers in a list of character-parsers, and gathering the result“s” as a list of character“s”.

answered Aug 12 '18 at 23:31

leftaroundabout

117,950
5
174
319

Primary objective: Haskell semantically-readable/clean EDSL-style without clunky hacks all over. `sequence` LGTM. Been using combinator ops `<>`, `<*`, `*>`, `<|>` I'm sure some of these type impedance mismatches will go away with parse tree or AST parser return types. Been doing `(:) something <$> whatever` here and there, but it seems to make the intention less clear. – Aug 12 '18 at 23:40

Mixing Parser Char (lexer?) vs. Parser String

2 Answers2