How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

Question

{-# LANGUAGE OverloadedStrings #-}

import Data.Attoparsec.Text
import Control.Applicative(many)
import Data.Word

parseManyNumbers :: Parser [Int] -- I'd like many to return a Vector instead
parseManyNumbers = many (decimal <* skipSpace)

main :: IO ()
main = print $ parseOnly parseManyNumbers "131 45 68 214"

The above is just an example, but I need to parse a large amount of primitive values in Haskell and need to use arrays instead of lists. This is something that possible in the F#'s Fparsec, so I've went as far as looking at Attoparsec's source, but I can't figure out a way to do it. In fact, I can't figure out where many from Control.Applicative is defined in the base Haskell library. I thought it would be there as that is where documentation on Hackage points to, but no such luck.

Also, I am having trouble deciding what data structure to use here as I can't find something as convenient as a resizable array in Haskell, but I would rather not use inefficient tree based structures.

An option to me would be to skip Attoparsec and implement an entire parser inside the ST monad, but I would rather avoid it except as a very last resort.

Over 10M of either floats or integers. Actually, it is a parser for multidimensional arrays for a [compiler project](http://futhark-lang.org//). The one it has now is by Happy and it is far too slow for the larger inputs, so I am hoping to replace it. My functional programming experience is in F# rather than Haskell so I am a bit taken aback here. The biggest hurdle here seems to be how to make use of a resizable array. — Marko Grdinić, Aug 05 '16 at 16:04
The default definition of __many__ is may be found [here](http://hackage.haskell.org/package/base-4.9.0.0/docs/src/GHC.Base.html#line-701). Of course, Parsec has it's own many combinator. — ErikR, Aug 05 '16 at 16:04
What kind of operations are you going perform on the values? I presume you want a Vector just for the space economy - not for random access. What does your code do now when the values are returned as a list? — ErikR, Aug 05 '16 at 16:09
Convert them into the internal primitive type and pass them onto the later stages in the compiler. You are right that I want arrays for the space economy, but lists are slow in addition to that. I am not sure if the compiler converts the list of values to a Vector if that is what you are asking. In general, the parser it currently has is quite inefficient and takes up 2.6Gb when loaded in GHCI. Until recently, we've even suspected it to be related to a space leak in `ghc-mod` but that turned out not to be the case. It might be worth redoing completely. — Marko Grdinić, Aug 05 '16 at 16:16
Not sure if it would be an improvement over simply `fmap`ing `Data.Vector.fromList`, but what about using `Data.Vector.Fusion.Bundle.Monadic.unfoldrM` and (not exposed, though) `Data.Vector.Generic.unstreamM`? It looks like there may be fusion that could avoid the intermediate list. — ryachza, Aug 05 '16 at 16:28
That sounds interesting. First time hearing about these functions. I was under the impression that fusion is done under the hood by the Haskell compiler. I should emphasize that I am not affiliated with the aforementioned compiler project and that my knowledge of Haskell is pretty rudimentary. For me, it might be worth looking at how `Vector`'s unfold is made under the hood since if that is at all done efficiently, it necessarily has to have array resizing mechanisms inside. — Marko Grdinić, Aug 05 '16 at 16:43
Though that might be worth barking up the wrong tree. Right now, I'd do better with an explanation of the `many` function. First I need to figure out how is the parser even getting applied. — Marko Grdinić, Aug 05 '16 at 16:49
@MarkoGrdinic My understanding of fusion is that it's sort of like rewrite rules - the compiler will apply it automatically but you have to indicate where it *can be* applied. The Vector library, because it aims to be performant, should be structured to make use of it. It appears to use "Bundle" and/or "Stream" to facilitate the fusion but I don't know in what scenarios they will be superior enough to warrant the complexity of using them directly. — ryachza, Aug 05 '16 at 18:38

score 2 · Answer 1 · answered Aug 05 '16 at 19:22

There is a growable vector implementation in Haskell, which is based on the great AMT algorithm: "persistent-vector". Unfortunately, the library isn't that much known in the community so far. However to give you a clue about the performance of the algorithm, I'll say that it is the algorithm that drives the standard vector implementations in Scala and Clojure.

I suggest you implement your parser around that data-structure under the influence of the list-specialized implementations. Here the functions are, btw:

-- | One or more.
some :: f a -> f [a]
some v = some_v
  where
    many_v = some_v <|> pure []
    some_v = (fmap (:) v) <*> many_v

-- | Zero or more.
many :: f a -> f [a]
many v = many_v
  where
    many_v = some_v <|> pure []
    some_v = (fmap (:) v) <*> many_v

Benjamin Hodgson · Answer 2 · 2016-08-05T19:14:56.680

Vectors are arrays, under the hood. The tricky thing about arrays is that they are fixed-length. You pre-allocate an array of a certain length, and the only way of extending it is to copy the elements into a larger array.

This makes linked lists simply better at representing variable-length sequences. (It's also why list implementations in imperative languages amortise the cost of copying by allocating arrays with extra space and copying only when the space runs out.) If you don't know in advance how many elements there are going to be, your best bet is to use a list (and perhaps copy the list into a Vector afterwards using fromList, if you need to). That's why many returns a list: it runs the parser as many times as it can with no prior knowledge of how many that'll be.

On the other hand, if you happen to know how many numbers you're parsing, then a Vector could be more efficient. Perhaps you know a priori that there are always n numbers, or perhaps the protocol specifies before the start of the sequence how many numbers there'll be. Then you can use replicateM to allocate and populate the vector efficiently.

score 1 · Answer 3 · answered Aug 05 '16 at 17:31

Some ideas:

Data Structures

I think the most practical data structure to use for the list of Ints is something like [Vector Int]. If each component Vector is sufficiently long (i.e. has length 1k) you'll get good space economy. You'll have to write your own "list operations" to traverse it, but you'll avoid re-copying data that you would have to perform to return the data in a single Vector Int.

Also consider using a Dequeue instead of a list.

Stateful Parsing

Unlike Parsec, Attoparsec does not provide for user state. However, you might be able to make use of the runScanner function (link):

runScanner :: s -> (s -> Word8 -> Maybe s) -> Parser (ByteString, s)

(It also returns the parsed ByteString which in your case may be problematic since it will be very large. Perhaps you can write an alternate version which doesn't do this.)

Using unsafeFreeze and unsafeThaw you can incrementally fill in a Vector. Your s data structure might look something like:

data MyState = MyState 
             { inNumber   :: Bool           -- True if seen a digit
             , val        :: Int            -- value of int being parsed
             , vecs       :: [ Vector Int ] -- past parsed vectors 
             , v          :: Vector Int     -- current vector we are filling
             , vsize      :: Int            -- number of items filled in current vector
             }

Maybe instead of a [Vector Int] you use a Dequeue (Vector Int).

I imagine, however, that this approach will be slow since your parsing function will get called for every single character.

Represent the list as a single token

Parsec can be used to parse a stream of tokens, so how about writing your own tokenizer and letting Parsec create the AST.

The key idea is to represent these large sequences of Ints as a single token. This gives you a lot more latitude in how you parse them.

Defer Conversion

Instead of converting the numbers to Ints at parse time, just have parseManyNumbers return a ByteString and defer the conversion until you actually need the values. This much enable you to avoid reifying the values as an actual list.

Those aren't bad ideas. In regard to the later two I should have mentioned that the fast parser is for the value syntax. The approach I am going to go for here is figure out how Attoparsec internals work, and once I figure out how to trigger the parser, just apply it repeatedly inside a Vector.unfoldr. — Marko Grdinić, Aug 05 '16 at 17:57
On the point of writing one's own list operations: if you define a `newtype` for `[Vector a]` (or just use [`Compose`](https://hackage.haskell.org/package/transformers-0.5.1.0/docs/Data-Functor-Compose.html)) the derived `Foldable` and `Traversable` instances will do the right thing — Benjamin Hodgson, Aug 05 '16 at 19:06

How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

3 Answers3

Data Structures

Stateful Parsing

Represent the list as a single token

Defer Conversion