Why does this parser always fail when the end-of-line sequence is CRLF?

Question

This simple parser is expected to parse messages of the form

key: value\r\nkey: value\r\n\r\nkey: value\r\nkey: value\r\n\r\n

One EOL acts as a field separator, and double EOL acts as a message separator. It works perfectly fine when the EOL separator is \n but parseWith always returns fail when it is \r\n.

parsePair = do
  key <- B8.takeTill (==':')
  _ <- B8.char ':'
  _ <- B8.char ' '
  value <- B8.manyTill B8.anyChar endOfLine
  return (key, value)

parseListPairs = sepBy parsePair endOfLine <* endOfLine

parseMsg = sepBy parseListPairs endOfLine <* endOfLine

Where did you get `endOfLine` from, or how did you define it? What does `B8` stand for? I can make an educated guess, but still... You need to make the code in your question self-contained, by adding your import statements and custom definitions. The problem is probably that `endOfLine` is defined to parse `\n`, not `\r\n`. You will have to define your own `endOfLine` parser for `\r\n`. — jub0bs, Sep 03 '15 at 06:46
@Jubobs Probably not. Chances are this is [attoparsec's `endOfLine`](https://hackage.haskell.org/package/attoparsec-0.13.0.1/docs/Data-Attoparsec-Text.html#v:endOfLine). — duplode, Sep 03 '15 at 06:55
@duplode You're right, but the relevant link is probably [this one](http://hackage.haskell.org/package/attoparsec-0.13.0.1/docs/Data-Attoparsec-ByteString-Char8.html#g:13), as suggested by the name `B8`. — jub0bs, Sep 03 '15 at 07:00
endOfLine is from [Data.Attoparsec.ByteString.Char8](http://hackage.haskell.org/package/attoparsec-0.13.0.1/docs/src/Data-Attoparsec-ByteString-Internal.html#endOfLine]%28Data.Attoparsec.ByteString.Char8%29) I checked the implementation and it handles both end lines. — concept3d, Sep 03 '15 at 07:33
You need to clarify. You state that one EOL acts as a pair separator, and two EOLs act as a message separator, but your example (`key: value\r\nkey: value\r\n\r\n`) suggests that two EOLs act as a message *terminator*, instead. Separator would make more sense to me, but which is it? Separator, or terminator? — jub0bs, Sep 04 '15 at 10:18

ErikR · Accepted Answer · 2015-09-03T07:09:32.500

4

I'm assuming you are using these imports:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Attoparsec.ByteString.Char8 as B8
import Data.Attoparsec.ByteString.Char8

The problem is that endOfLine consumes the end of line, so perhaps you really want something like:

parseListPairs = B8.many1 parsePair <* endOfInput

For instance, this works:

ghci> parseOnly parseListPairs "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]

Update:

For parsing multiple messages you can use:

parseListPairs = B8.manyTill parsePair endOfLine
parseMsgs = B8.manyTill parseListPairs endOfInput

ghci> test3 = parseOnly parseMsgs "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4\r\n\r\n"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]

edited Sep 03 '15 at 07:09

answered Sep 03 '15 at 06:53

ErikR

51,541
9
73
124

1

This is definitely better, but it still fails at some point, I need to revise the structure and do some tests. – concept3d Sep 03 '15 at 07:29
`parseOnly parseMsgs "k: v"` fails, but should yield `Right [[("k","v")]]`. – jub0bs Sep 03 '15 at 14:35

jub0bs · Answer 2 · 2015-09-03T18:38:46.840

Problems

Your code isn't self-contained and the actual problem is unclear. However, I suspect your woes are actually caused by how keys are parsed; in particular, something like \r\nk is a valid key, according to your parser:

λ> parseOnly parsePair "\r\nk: v\r\n"
Right ("\r\nk","v")

That needs to be fixed.

Moreover, since one EOL separates (rather than terminates) key-value pairs, an EOL shouldn't be consumed at the end of your parsePair parser.

Another tangential issue: because you use the many1 combinator instead ByteString-oriented parsers (such as takeTill), your values have type String instead of ByteString. That's probably not what you want, here, because it defeats the purpose of using ByteString in the first place.; see Performance considerations.

Solution

I suggest the following refactoring:

{-# LANGUAGE OverloadedStrings #-}

import Data.ByteString ( ByteString )

import Data.Attoparsec.ByteString.Char8 ( Parser
                                        , count
                                        , endOfLine
                                        , parseOnly
                                        , sepBy
                                        , string
                                        , takeTill
                                        )

-- convenient type synonyms
type KVPair = (ByteString, ByteString)
type Msg = [KVPair]

pair :: Parser KVPair
pair = do
    k <- key
    _ <- string ": "
    v <- value
    return (k, v)
  where
    key     = takeTill (\c -> c == ':' || isEOL c)
    value   = takeTill isEOL
    isEOL c = c == '\n' || c == '\r'

-- one EOL separates key-value pairs
msg :: Parser Msg
msg = sepBy pair endOfLine

-- two EOLs separate messages
msgs :: Parser [Msg]
msgs = sepBy msg (count 2 endOfLine)

I have renamed your parsers, for consistency with attoparsec's, none of which have "parse" as a prefix:

parsePair --> pair
parseListPairs --> msg
parseMsg --> msgs

Tests in GHCi

λ> parseOnly keyValuePair  "\r\nk: v"
Left "string"

Good; you do want a fail, in this case.

λ> parseOnly keyValuePair  "k: v"
Right ("k","v")

λ> parseOnly msg "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]

λ> parseOnly msgs  "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]

λ> parseOnly msgs "k: v"
Right [[("k","v")]]

I understand the performance consideration, my initial implementation was using Bytestring `parsePair :: Parser (ByteString,ByteString)` and takeTill/takeWhile combinators, actually it was very similar to the one you provided, it also failed but I couldn't find the problem, and I posted a different version, now after I saw your impl I suspect my problem was in the definition of isEndOfLine — concept3d, Sep 03 '15 at 08:59
I ended up with infinite loop, same result as my original implementation, I will try to figure out the problem. There is also a concern that I am reading from socket. this implementation worked fine with parseOnly the first time. — concept3d, Sep 03 '15 at 09:24
@concept3d Just curious: have you made any progress on this? — jub0bs, Sep 06 '15 at 11:24
this solution still goes into infinite loop when used with parseWith, similar to my original solution which I didn't post. I used the accepted answer temporarily, I'll get back to this soon but I have to finish more important things first. — concept3d, Sep 08 '15 at 07:49
@concept3d Ok. I don't think I can investigate any further until I see your code that uses `parseWith`. Good luck. — jub0bs, Sep 08 '15 at 08:16

Why does this parser always fail when the end-of-line sequence is CRLF?

2 Answers2

Problems

Solution

Tests in GHCi