4

The requirements are taken from the DOT language specification, more precisely I'm trying to parse the [ID] attribute, which can be e.g.,

any double-quoted string ("...") possibly containing escaped quotes (\")1;

The following should be a minimal example.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import           Text.Megaparsec
import           Text.Megaparsec.Char
import           Data.Void
import           Data.Char
import           Data.Text               hiding ( map
                                        , all
                                        , concat
                                        )

type Parser = Parsec Void Text

escape :: Parser String
escape = do
    d <- char '\\'
    c <- oneOf ['\\', '\"', '0', 'n', 'r', 'v', 't', 'b', 'f']
    return [d, c]

nonEscape :: Parser Char
nonEscape = noneOf ['\\', '\"', '\0', '\n', '\r', '\v', '\t', '\b', '\f']

identPQuoted :: Parser String
identPQuoted =
    let inner = fmap return (try nonEscape) <|> escape
    in  do
      char '"'
      strings <- many inner
      char '"'
      return $ concat strings

identP :: Parser Text
identP = identPQuoted >>= return . pack

main = parseTest identP "\"foo \"bar\""

The above code fails on the second with returns "foo " even though I want foo "bar

I don't understand why. I thought that megaparsec would repeatedly apply inner until it parses the final ". But it only repeatedly applies the nonEscape parser and the first time that fails, and it uses escape, it then appears to skip the rest of the inner string and just move on to the final quotes.

Vey
  • 435
  • 5
  • 15
  • `>>= return .` can be replaced by `<$>`: `identP = pack <$> identPQuoted` – melpomene Sep 13 '18 at 18:34
  • The `do` block in `identPQuoted` can be written as `char '"' *> (concat <$> many inner) <* char '"'` – melpomene Sep 13 '18 at 18:37
  • Can you post a [mcve]? I'd like to try this myself. – melpomene Sep 13 '18 at 18:37
  • The requirement is rather poorly worded. Can you show a real grammar of your input language? Also, try `reads`. – n. m. could be an AI Sep 13 '18 at 18:48
  • I added an example @melpomene and updated the requirements – Vey Sep 13 '18 at 19:26
  • A list of characters like `['\\', '\"', '0', 'n', 'r', 'v', 't', 'b', 'f']` can be written more conveniently and compactly as a `String` literal like `"\\\"0nrvtbf"`, since `type String = [Char]` – Jon Purdy Sep 14 '18 at 02:10
  • Warning: the list of characters in `nonEscape` is a list of *single characters*, i.e. `\n` is a literal newline character, `\0` is the null character (which you would never see in a plain text file). Probably not what you meant. – luqui Sep 14 '18 at 07:49
  • @luqui That part looks fine to me. It says double-quoted strings cannot contain literal control characters (such as NUL, newline, etc). – melpomene Sep 14 '18 at 12:35

1 Answers1

7

Your input text is "foo "bar", which does not contain any escaped quotes. It is parsed as a complete ID of "foo " (followed by bar", which is ignored).

If you want to make sure that your parser consumes all of the available input, you can use

parseTest (identP <* eof) "..."

If you want to provide an ID with an escaped quote to the parser, like this ...

"foo \"bar"

... then you need to escape all of the special characters to embed them in Haskell source code:

main = parseTest identP "\"foo \\\"bar\""

\" represents a literal " and \\ represents a literal \.

melpomene
  • 84,125
  • 8
  • 85
  • 148