1

What is the most efficient way to parse a large text content (300K+) for all matches of already created Attoparsec parser?

I have written a slow performant code like that:

import Data.Either (rights)

findAll :: Parser a -> String -> [a]
findAll parser = rights . map (parseOnly parser . pack) . oneLess where
                        oneLess []           = []
                        oneLess (whole@(_:xs)) = whole : oneLess xs

It is for String, but I think the best will be with ByteStrings.

Parsing "abba" in "abbabba" should return only one match ["abba"], i.e. after it match then to continue after it.

The_Ghost
  • 2,070
  • 15
  • 26
  • 2
    Yes, `ByteString` or `Text` is almost always a better option than `String`. But it would be useful to know why your code is slow ? Is the memory getting filled ? Also, if you use the function `parseOnly` from the module [Data.Attoparsec.ByteString](https://hackage.haskell.org/package/attoparsec-0.12.1.1/docs/Data-Attoparsec-ByteString.html), your function will become `findAll :: Parser a -> ByteString -> [a]` with little modifications. Use Pipes or Conduit, if you want to execute it under constant memory. – Sibi Aug 20 '14 at 18:35
  • 1
    To clarify, if you have a parser that parses `"abba"` and an input string `"ababbabba"` you'd like `findAll` to return `["abba", "abba"]`? – cdk Aug 20 '14 at 19:19
  • @cdk, ideally it should return only ["abba"], i.e. when match a pattern to continue after this whole match. – The_Ghost Aug 23 '14 at 10:56
  • It's not clear what you mean by "this whole match". – dfeuer Aug 23 '14 at 19:06
  • If I have parser that parses "abba" and an input string "ababbabba" it should return ["abba"]. If input string is "ababbaabba" it should return ["abba", "abba"]. – The_Ghost Aug 25 '14 at 22:08

0 Answers0