Out of Memory Using Attoparsec

Question

I'm trying to make a simple parser with attoparsec. The production rules are along the lines of:

block:  ?token> [inline]
inline: <?token>foo<?> | anyText

So, what I'm trying to get at is, a block starts with the literal ?, followed by a token, followed by a >, followed by a sequence of inlines.

And an inline is either a sequence of the form foo, or just any plain text.

I am having explosive memory use, but I'm not sure how I can factor the parser to avoid it. The point of the parser I'm writing is to pull out those 'token' things. Here is my implementation:

import Control.Applicative
import Control.Monad
import Data.Attoparsec.Text as Text
import Data.Text

blockLine :: Parser [Text]
blockLine = do
  block   <- hiddenBlock                       -- the block token
  inlines <- many (hiddenInline <|> inline)    -- followed by inlines, which might have tokens
  return $ block : inlines

inline = manyTill anyChar (hiddenInline <|> (endOfInput >> return Text.empty)) 

hiddenInline = Text.pack <$> do
  char '<'   -- opening "tag"
  char '?'   -- opening "tag" still
  token <- manyTill anyChar (char '>')  -- the token
  manyTill anyChar (string "<?>") -- close the "tag"
  return token

hiddenBlock = Text.pack <$> do
  char '?'
  manyTill anyChar (char '>')

This looks, to me, to be a very straightforward translation of the production rules into an LL parser. I suppose the difficulty is that I'm not sure how to express the production for an inline. It's supposed to be "arbitrary" text, but the parse should stop as soon as it finds a hiddenInline.

score 2 · Accepted Answer · answered May 17 '14 at 05:13

The problem is your nesting of a call to manyTill inside of a use of many. Since a termination condition of inline is endOfFile, manyTill anyChar will happily consume all your input and then succeed. Subsequent uses of inline will also succeed, since manyTill can run its first parser zero or more times. So, using many on your inline parser will just cause many to successfully loop forever while producing an infinite list of empty strings. This behavior is more obvious with this example

 parseOnly (many (manyTill anyChar endOfInput)) $ Text.pack ""

The large amount of allocation is probably due to attoparsec building up continuations to manage backtracking. As a general rule, any parser you feed into many should not be able to trivially succeed (i.e. without consuming any of the input stream). So, you will need to either rewrite inline or otherwise restructure your parser to avoid this case.

Out of Memory Using Attoparsec

1 Answers1