Conduit and Attoparsec - extracting delimited text

Question

Say I have a document with text delimited by Jade-style brackets, like {{foo}}. I've written an Attoparsec parser that seems to extract foo properly:

findFoos :: Parser [T.Text]
findFoos = many $ do
  manyTill anyChar (string "{{")
  manyTill letter (string "}}")

Testing it shows that it works:

> parseOnly findFoos "{{foo}}"
Right ["foo"]
> parseOnly findFoos "{{foo}} "
Right ["foo"]

Now, with the Data.Conduit.Attoparsec module in conduit-extra, I seem to be running into strange behavior:

> yield "{{foo}}" $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
["foo"]
> yield "{{foo}} " $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
-- floods stdout with empty lists

Is this the desired behavior? Is there a conduit utility I should be using here? Any help with this would be tremendous!

danidiaz · Accepted Answer · 2015-05-23T10:24:30.690

Because it uses many, findFoos will return [] without consuming input when it doesn't find any delimited text.

On the other hand, conduitParser applies a parser repeatedly on a stream, returning each parsed value until it exhausts the stream.

The problem with "{{foo}} " is that the parser will consume {{foo}}, but the blank space remains unconsumed in the stream, so further invocations of the parser always return [].

If you redefine findFoos to consume one quoted element at a time, including the trailing blanks, it should work:

findFoos' :: Parser String
findFoos' = do
   manyTill anyChar (string "{{")
   manyTill letter (string "}}") <* skipSpace

Real-world examples will have other characters between bracketed texts, so skipping the "extra stuff" after each parse (without consuming any of the {{ opening braces for the next parse) will be a bit more involved.

Perhaps something like the following will work:

findFoos'' :: Parser String
findFoos'' = do
    manyTill anyChar (string "{{")
    manyTill letter (string "}}") <* skipMany everythingExceptOpeningBraces
  where 
    -- is there a simpler / more efficient way of doing this?
    everythingExceptOpeningBraces =
        -- skip one or more non-braces
        (skip (/='{') *> skipWhile (/='{'))
        <|> 
        -- skip single brace followed by non-brace character
        (skip (=='{') *> skip (/='{'))
        <|>
        -- skip a brace at the very end 
        (skip (=='{') *> endOfInput)

(This parser will fail, however, if there aren't any bracketed texts in the stream. Perhaps you could build a Parser (Maybe Text) that returns Nothing in that case.)

Conduit and Attoparsec - extracting delimited text

1 Answers1