0

I'm looking at this example from attoparsec docs:

simpleComment   = string "<!--" *> manyTill anyChar (string "-->")

This will build a [Char] instead of a ByteString slice. That's not good with huge comments, right?

The other alternative, takeWhile:

takeWhile :: (Word8 -> Bool) -> Parser ByteString

cannot accept a parser (i.e. cannot match a ByteString, only a Word8).

Is there a way to parse chunk of ByteString with attoparsec that doesn't involve building a [Char] in the process?

levant pied
  • 3,886
  • 5
  • 37
  • 56

1 Answers1

2

You can use scan:

scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString

A stateful scanner. The predicate consumes and transforms a state argument, and each transformed state is passed to successive invocations of the predicate on each byte of the input until one returns Nothing or the input ends.

It would look something like this:

transitions :: [((Int, Char), Int)]
transitions = [((0, '-'), 1), ((1, '-'), 2), ((2, '-'), 2), ((2, '>'), 3)]

dfa :: Int -> Word8 -> Maybe Int
dfa 3 w = Nothing
dfa s w = lookup (s, toEnum (fromEnum w)) transitions <|> Just 0

And then use scan 0 dfa to take bytes up to and including the final "-->". The state I'm using here tells how many characters of "-->" we've seen so far. Once we've seen them all we inform scan that it's time to stop. This is just to illustrate the idea; for efficiency you might want to use a more efficient data structure than association lists, move the *Enum calls into the lookup table, and even consider writing the function directly.

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380
  • OK that would work in this case, but no way to generalize to a custom parser as `manyTill` does? Just as an example, say that instead of `-->` the end HTML comment can also have digits interspersed - so e.g. `-1-9>`, `-55->` and `--37>` are all valid endings. Sure you can write it manually with `scan`, but kind of defeats the purpose of a parser, right? – levant pied Sep 05 '20 at 00:09
  • 1
    @levantpied I believe that API is not currently conveniently exposed from attoparsec. A patch to add it to the API would almost certainly be accepted, and I believe there is no technical reason it could not be created. In the meantime, enough internals of the `Parser` type look like they are exposed for you to play around with it aftermarket, so to speak, and only propose a patch once you have something working. – Daniel Wagner Sep 05 '20 at 00:14