4

I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString instead of String, but there is a string combinator which just works with String.

Is there a Parsec combinator similar to string that works with ByteString, without having to do conversions back and forth?

I've seen there is an alternative parser that handles ByteString: attoparsec, but I would prefer to stick with Parsec, since I'm just learning how to use it.

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
  • 3
    `attoparsec` is a lot faster and has an API very similar to Parsec, so I'd really recommend you to look into it! – Niklas B. Mar 16 '13 at 22:14
  • 4
    I highly recommend you use the `cassava` library to parse CSVs. It's very fast (it uses `attoparsec` internally) and it's very easy to use. – Gabriella Gonzalez Mar 17 '13 at 01:25
  • I assume you mean on the return side? You'll have to do your own `pack`ing afterwards, ie. `pack $ string "foobar"` (remembering to think about encoding.) Other than that, Parsecs `string` works just fine on `ByteString`s on the input side. – Sarah Mar 17 '13 at 21:57
  • I'll certainly run some experiments using `attoparsec` and `Cassava`. However, my goal was to check how to use Parsec. CSV parsing is just a means to that end (for now). Thanks for sharing all this information! – Ariel D. Moya Sequeira Mar 18 '13 at 01:58
  • About the `pack`ing, I wanted to avoid it. I saw that Parsec handles `ByteString` input well, so yes: I meant the "return side". – Ariel D. Moya Sequeira Mar 18 '13 at 02:01

2 Answers2

5

I'm assuming you're starting with something like

import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString

Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.

Something I noticed along the way:

  • If you import Text.Parsec.ByteString then this uses uncons from Data.ByteString.Char8, which in turn uses w2c from Data.ByteString.Internal, to convert all read bytes to Chars. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to use string and friends without problem.

Thus, the easy version of the CSV parser, which does exactly that:

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)

import Text.Parsec
import Text.Parsec.ByteString

csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser String
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

quotedChar :: Parser Char
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return '"')

eol :: Parser String
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString or [Word8] or something similar all the way through? Hence my second attempt below. I am still importing Text.Parsec.ByteString, which may be a mistake, and the code is hopelessly riddled with conversions.

But, it compiles and has complete type annotations, and therefore should make a sound starting point.

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)

import Data.Word (Word8)
import Data.ByteString.Internal (c2w)

import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)

csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))

quotedCell :: Parser ByteString
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return (ByteString.pack content)

quotedChar :: Parser Word8
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return (c2w '"'))

eol :: Parser ByteString
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

-- replacements for some of the functions in the Parsec library

noneOf :: String -> Parser Word8
noneOf cs   = satisfy (\b -> b `notElem` [c2w c | c <- cs])

char :: Char -> Parser Word8
char c      = byte (c2w c)

byte :: Word8 -> Parser Word8
byte c      = satisfy (==c)  <?> show [c]

satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f   = tokenPrim (\c -> show [c])
                        (\pos c _cs -> updatePosChar pos c)
                        (\c -> if f (c2w c) then Just (c2w c) else Nothing)

string :: String -> Parser ByteString
string s    = liftM Char8.pack (tokens show updatePosString s)

Probably your concern, efficiency-wise, should be those two ByteString.pack instructions, in the definitions of cell and quotedCell. You might try to replace the Text.Parsec.ByteString module so that instead of “making strict ByteStrings an instance of Stream with Char token type,” you make ByteStrings an instance of Stream with Word8 token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.

No, the way to make it more efficient would be to change the types of char, quotedChar and string to Parser [Word8] and the types of line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You could even change the type of eol to Parser (). The necessary changes would look something like this:

cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser [Word8]
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

string :: String -> Parser [Word8]
string s    = [c2w c | c <- (tokens show updatePosString s)]

You don't need to worry about all the calls to c2w as far as efficiency is concerned, because they cost nothing.

If this doesn't answer your question, please say what would.

Beetle
  • 1,959
  • 16
  • 32
  • 1
    Wow! Your answer is very thorough. Indeed, that'll do. – Ariel D. Moya Sequeira Apr 30 '13 at 17:07
  • Beetle, your answer imports so many different `ByteString` modules, I'm very confused as to which ones are being used. Could you clarify it more? – tolgap Jan 31 '15 at 13:29
  • @tolgap: Firstly, you'll need to be looking at [Chapter 16 of the book](http://book.realworldhaskell.org/read/using-parsec.html#csv). All `import`s are necessary, so I'm using all of them. My goal here was just to create something that would compile, so that if the author got frustrated, he could look at the differences between his code and mine. The explanations are in the [documentation for the various modules](https://www.haskell.org/hoogle/). I can't remember any specific details from when I wrote this answer two years ago. – Beetle Feb 01 '15 at 16:31
  • @tolgap: There shouldn't be any ambiguity. For example, `getContents` is imported from `Data.ByteString` and hidden from `Prelude`. The two `pack` functions are imported as `ByteString.pack` and `Char8.pack`. In fact, the only time I haven't explicitly named what functions are imported is with `Text.Parsec.ByteString`, and from [its documentation](http://hackage.haskell.org/package/parsec-3.1.8/docs/Text-Parsec-ByteString.html) I can see that that's because it doesn't export anything except `parseFromFile` (which I'm not using) and the `Parser` type (which I am). What's confusing you? – Beetle Feb 01 '15 at 16:55
  • Thank you beetle, the book and documentation links are very helpful. – tolgap Feb 01 '15 at 17:06
0

I don't believe so. You will need to create one yourself using tokens. Although the documentation for it is a bit... nonexistent, the first two arguments are a function to use to show the expected tokens in an error message and a function to update the source position that will be printed in errors.

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380