Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number
whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy
like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy
will fail if the single quote appears just
before the end of input. The second notFollowedBy
will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof
. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]