How do I turn this regex into a Megaparsec parser without making a mess?

Question

Consider this regex:

^foo/[^=]+/baz=(.*),[^,]*$

If I run it on foo/bar/baz=one,two, it matches and the subgroup captures one. If I run it on foo/bar/baz/bar/baz=three,four,five, it matches and the subgroup captures three,four.

I know how to turn this into a regex-applicative parser or a ReadP parser:

import Text.Regex.Applicative
match (string "foo/" *> some (psym (/= '=')) *> string "/baz=" *> many anySym <* sym ',' <* many (psym (/= ','))) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Just "one",Just "three,four"]

import Text.ParserCombinators.ReadP
readP_to_S (string "foo/" *> many1 (satisfy (/= '=')) *> string "/baz=" *> many get <* char ',' <* many (satisfy (/= ',')) <* eof) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [[("one","")],[("three,four","")]]

And both of those work just the way I want them to. But when I try to transliterate that directly into Megaparsec, it goes badly:

import Text.Megaparsec
parse (chunk "foo/" *> some (anySingleBut '=') *> chunk "/baz=" *> many anySingle <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Left (ParseErrorBundle {bundleErrors = TrivialError 11 (Just (Tokens ('=' :| "one,"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz=one,two", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}}),Left (ParseErrorBundle {bundleErrors = TrivialError 19 (Just (Tokens ('=' :| "thre"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz/bar/baz=three,four,five", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})]

I know this stems from Megaparsec not backtracking by default. I tried to fix this by just sticking try in a bunch of different places, but I couldn't get that to work. Eventually, I got this monstrosity with notFollowedBy to work:

import Text.Megaparsec
parse (chunk "foo/" *> some (noneOf "=/" <|> try (single '/' <* notFollowedBy (chunk "baz="))) *> chunk "/baz=" *> many (try (anySingle <* notFollowedBy (many (anySingleBut ',') <* eof))) <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Right "one",Right "three,four"]

But that looks like a mess! In particular, I don't like that I effectively had to specify much of the pattern twice. And technically, wouldn't that be equivalent to the regex ^foo/(?:[^=/]|/(?!baz=))+/baz=((?:.(?![^,]*$))*),[^,]*$, rather than my initial regex? There's got to be a better way to write that parser. How do I do it?

Edit: I also tried it this way, ~~which also works~~ (nope, it incorrectly accepts foo//baz=,):

import Text.Megaparsec
parse (chunk "foo/" *> (some . try $ many (noneOf "=/") *> single '/') *> chunk "baz=" *> ((++) <$> many (anySingleBut ',') <*> (concat <$> manyTill ((:) <$> single ',' <*> many (anySingleBut ',')) (try $ single ',' *> many (anySingleBut ',') *> eof)))) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Right "one",Right "three,four"]

It seems just as messy, though, and manyTill means it doesn't really map onto any regex anymore.

Daniel Wagner · Answer 1 · 2020-01-09T05:04:34.860

2

Without reading carefully, I guess the bit that's giving you trouble is this part:

(.*),[^,]*

If so, then consider using

sepBy (many (noneOf ",")) (string ",")

which will parse a list of comma-separated things. Then re-insert commas between all but the last element of that list in pure code afterwards (e.g. with a well-placed fmap).

From the comments, it seems you are also having some trouble with this part:

/[^=]+/baz=

You could consider something like this as a translation for that:

slashPath = string "/" <++> path
path = string "baz=" <|> (many (noneOf "=/") <++> slashPath)
(<++>) = liftA2 (++)

edited Jan 09 '20 at 05:04

answered Jan 08 '20 at 21:39

Daniel Wagner

145,880
9
220
380

That looks like it might work for the comma part of the problem, but not for the equals sign part of the problem. – Joseph Sible-Reinstate Monica Jan 08 '20 at 21:41
@JosephSible-ReinstateMonica `Text.Megaparsec` exports `sepBy` (by re-exporting `Control.Monad.Combinators`). What makes the equals-sign stuff hard? (What goes wrong with just using `many` and a parser that matches any character but `=`?) – Daniel Wagner Jan 08 '20 at 21:43
Ah, indeed. But the other part of my comment still stands: it doesn't look helpful for the equals sign part, since the bit in the second argument of `sepBy` gets thrown away, and I need some of what's there. – Joseph Sible-Reinstate Monica Jan 08 '20 at 21:44
@JosephSible-ReinstateMonica Does `some {- translation of [^=] -}` work as a translation of `[^=]+`? If not, why not? – Daniel Wagner Jan 08 '20 at 21:44
That would seem to eat the `/baz` that I want to match. – Joseph Sible-Reinstate Monica Jan 08 '20 at 21:46
1

@JosephSible-ReinstateMonica Ah! Okay, I will ponder. – Daniel Wagner Jan 08 '20 at 21:46
1

@JosephSible-ReinstateMonica Perhaps something like: `eat = string "/baz=" <|> fmap concat (sequence [many (noneOf "=/"), string "/", eat])`? (Possibly even `many (noneOf "/")` to make it accept slightly more stuff, depending on needs and wants...?) – Daniel Wagner Jan 08 '20 at 21:50
Okay, following this advice makes it a little bit cleaner. It's still nowhere near as clean as the regex-applicative or ReadP parsers are though. What I'm really looking for is how to cleanly turn regexes into Megaparsec parsers in the general case, like I can for those other kinds of parsers, rather than complicated bespoke things for this particular regex. – Joseph Sible-Reinstate Monica Jan 08 '20 at 23:23
@JosephSible-ReinstateMonica I don't think there's a way to do that in general that's significantly simpler than compiling to a DFA and converting that to a big mutually-recursive parser, one definition per state. – Daniel Wagner Jan 09 '20 at 00:40
Your latest edit seems incorrect: `bar/baz=` matches the regex `[^=]+/baz=`, as well as the corresponding part of my messy parser, but not of your parser. – Joseph Sible-Reinstate Monica Jan 09 '20 at 03:30
@JosephSible-ReinstateMonica Sure, sure. Should be easy enough to fix up by including the preceding `/` separator, though. See update. – Daniel Wagner Jan 09 '20 at 05:05
Hmm, the structure of `slashPath` looks an awful lot like that of `some` now. I wonder if there's a generalization of it. – Joseph Sible-Reinstate Monica Jan 09 '20 at 05:14

How do I turn this regex into a Megaparsec parser without making a mess?

1 Answers1