3

I am stuck at the following parsing problem:

Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.

This is what I came up with (simplified):

import qualified Text.Megaparsec as MP

-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...

-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...

-- The escape character.
escChar :: Char
...


pComponent :: Parser (Maybe Text)
pComponent = do
  t <- MP.many (escaped <|> regular)
  if null t then return Nothing else return $ Just (T.pack t)
 where
  regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
  escaped = do
    _ <- MC.char escChar
    MP.satisfy isControlChar -- only control characters may be escaped

Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'. Then, the following parses correctly: ABC\:D:EF to yield ABC:D. However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.

Two questions:

  • Why does fail end parsing instead of failing the parser?
  • Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
Ulrich Schuster
  • 1,670
  • 15
  • 24

3 Answers3

2

many has to allow its sub-parser to fail once without the whole parse failing - for example many (char 'A') *> char 'B', while parsing "AAAB", has to fail to parse the B to know it got to the end of the As.

You might want manyTill which allows you to recognise the terminator explicitly. Something like this:

MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)

"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.

Or if you want to parse more than one component you might keep your existing definition of pComponent and use it with sepBy or similar, like:

MP.sepBy pComponent (MP.satisfy isControlChar)

If you also check for end-of-file after this, like:

MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof

then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.

David Fletcher
  • 2,590
  • 1
  • 12
  • 14
1

What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.

Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.

Environment and auxiliary functions:


import            Data.Void  (Void)
import qualified  Data.Text        as  T
import qualified  Text.Megaparsec  as  MP
import qualified  Text.Megaparsec.Char  as  MC

type Parser = MP.Parsec Void T.Text

-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch  =  elem ch ['A' .. 'Z']

-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"

-- The escape character:
escChar :: Char
escChar = '\\'

Termination parser, to be used for look ahead:

termination :: Parser ()
termination = MP.eof  MP.<|>  do
                                  _ <- MP.satisfy isControlChar
                                  return ()

Modified pComponent parser:

pComponent :: Parser (Maybe T.Text)
pComponent = do
    txt <- MP.many (escaped  MP.<|>  regular)
    MP.lookAhead  termination  --  **CHANGE HERE** 
    if (null txt)  then  (return Nothing)  else  (return $ Just (T.pack txt))
 where
   regular = (MP.satisfy isAdmissibleChar)  MP.<|>  (fail "Inadmissible character")
   escaped = do
     _ <- MC.char escChar
     MP.satisfy isControlChar -- only control characters may be escaped

Testing utility:

tryParse :: String -> IO ()
tryParse str = do
    let  res = MP.parse  pComponent  "(noname)"  (T.pack str)
    putStrLn $ (show res)

Let's try to rerun your examples:

$ ghci
 λ> 
 λ> :load q67809465.hs
 λ>
 λ> str1 = "ABC\\:D:EF"
 λ> putStrLn str1
 ABC\:D:EF
 λ> 
 λ> tryParse str1
 Right (Just "ABC:D")
 λ> 

So that is successful, as desired.

 λ> 
 λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
 λ> 

So that fails, as desired.

Trying our 2 acceptable termination contexts:

 λ> tryParse "ABC:&D"
 Right (Just "ABC")
 λ> 
 λ> 
 λ> tryParse "ABCDEF"
 Right (Just "ABCDEF")
 λ> 

jpmarinier
  • 4,427
  • 1
  • 10
  • 23
  • Thanks very much for the detailed explanation! Even though the answer by David Fletcher was simpler, your's gave me some important insight into parsing. – Ulrich Schuster Jun 03 '21 at 10:04
0

fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.

I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

Noughtmare
  • 9,410
  • 1
  • 12
  • 38