CSV Parsing Issue with Attoparsec

Question

Here is my code that does CSV parsing, using the text and attoparsec libraries:

import qualified Data.Attoparsec.Text as A
import qualified Data.Text as T

-- | Parse a field of a record.
field :: A.Parser T.Text -- ^ parser
field = fmap T.concat quoted <|> normal A.<?> "field"
  where
    normal  = A.takeWhile (A.notInClass "\n\r,\"")     A.<?> "normal field"
    quoted  = A.char '"' *> many between <* A.char '"' A.<?> "quoted field"
    between = A.takeWhile1 (/= '"') <|> (A.string "\"\"" *> pure "\"")


-- | Parse a block of text into a CSV table.
comma :: T.Text                   -- ^ CSV text
      -> Either String [[T.Text]] -- ^ error | table
comma text
  | T.null text = Right []
  | otherwise   = A.parseOnly table text
  where
    table  = A.sepBy1 record A.endOfLine A.<?> "table"
    record = A.sepBy1 field (A.char ',') A.<?> "record"

This works well for a variety of inputs but is not working in case that there is a trailing \n at the end of the input.

Current behaviour:

> comma "hello\nworld"
Right [["hello"],["world"]]

> comma "hello\nworld\n"
Right [["hello"],["world"],[""]]

Wanted behaviour:

> comma "hello\nworld"
Right [["hello"],["world"]]

> comma "hello\nworld\n"
Right [["hello"],["world"]]

I have been trying to fix this issue but I ran out of idaes. I am almost certain that it will have to be something with A.endOfInput as that is the significant anchor and the only "bonus" information we have. Any ideas on how to work that into the code?

One possible idea is to look at the end of the string before running the Attoparsec parser and removing the last character (or two in case of \r\n) but that seems to be a hacky solution that I would like avoid in my code.

Full code of the library can be found here: https://github.com/lovasko/comma

The issue is that `field` accepts the empty string (due to `takeWhile`), so `sepBy1 field ","` accepts the empty string (and "," ",," etc) , and `sepBy1 record eol` accepts "" "\n" "\n\n" etc. If the empty string is actually a valid field(CSV is very general so this is really a design choice), then the "wanted behaviour" would be wrong as those actually are different tables! If it is not, you should fix `field` and you're done since `parseOnly` already ignores all trailing input which is not recognized. — user2407038, Feb 03 '17 at 00:59

CSV Parsing Issue with Attoparsec

0 Answers0