1
{-# LANGUAGE ScopedTypeVariables,BangPatterns #-}

import qualified Data.Attoparsec.Internal as I
import qualified Data.Attoparsec.Internal.Types as T
import qualified Data.Vector.Unboxed as UVec
import qualified Data.Vector.Unboxed.Mutable as UMVec
import qualified Data.Vector as Vec
import qualified Data.Vector.Mutable as MVec
import qualified Data.Text as Text
import qualified System.IO.Unsafe as Unsafe

import Control.Monad.ST
import Control.Monad.Primitive

type Parser = T.Parser

manyCPSVec :: Parser Text.Text Char -> Parser Text.Text (Vec.Vector Char)
manyCPSVec parser = T.Parser $ \t pos more lose_fin win_fin ->
      let arr = Unsafe.unsafePerformIO (MVec.new 1024) in
      loop 0 arr t pos more lose_fin win_fin where
          loop i (arr :: MVec.MVector RealWorld Char) t pos more lose_fin win_fin =
              T.runParser parser t pos more lose win where
                  win t !pos more (a :: Char) =
                    Unsafe.unsafePerformIO (MVec.write arr i a) -- Here is the problem
                    loop (i+1) arr t pos more lose_fin win_fin
                  lose t pos more _ _ =
                      --x <- Vec.freeze arr
                      win_fin t pos more (Vec.empty)

main = print "Hello"

I am trying to put in some Vector functionality in Attoparsec for efficiency, but I've run into a wall.

If Attoparsec was not written using CPS, I could have used a functional unfoldr, but that is not an option here. The problem is that all calls have to be in the tail position here and no function returns in the standard sense of the thing therefore arrays have to be passed along as accumulators.

I tried to do this using the ST monad, but when that did not work I tried the above, but even so it fails to work. Haskell's type system is really killing me here. Is it in anyway possible to edit mutable arrays when programming with CPS?

If it is possible to do this within the ST monad I would doubly appreciate this.

Posts telling me to use data structures other than arrays though, will be downvoted.

Marko Grdinić
  • 3,798
  • 3
  • 18
  • 21
  • Just wondering, does this turn out to be any faster than collecting a list and using `Vector.fromList`? – Michael Aug 13 '16 at 17:11
  • I think you should be able to find a way to do this in `ST`. The challenge will be dealing efficiently with partial results, but I suspect there is no entirely efficient way to handle those. In particular, remember that a partial result may be resumed multiple times with different inputs, so too much `unsafePerformIO` could break your code. – dfeuer Aug 14 '16 at 00:48
  • @Michael Yeah, quite a bit. For parsing 10M integers, it makes a difference what data structure you use. Using [the completed parser](https://github.com/mrakgr/futhark/blob/parser_attempts/cps_parser_v5.hs) with boxed vectors it deals with 10M ints in 7.6s. With the unboxed parser it does it in 4s. Just using an extended functional unfoldr with a boxed Vector does it in 2s. `ResizeArray` in F# does it in 1.2s. The [persistent vector](https://hackage.haskell.org/package/persistent-vector) does it in 9.8. Apart from the F# solution, they all blow the heap on 100M. Lists do that long before 10M. – Marko Grdinić Aug 14 '16 at 09:13
  • @dfeuer To be honest, I've completely lost interest in this now. Literally nothing I've tried in Haskell lets me parse 100M ints without blowing the heap and I've been trying various approaches for 10 days now. I see it as a failure of the language. I ended up figuring out CPS and monads, but not how to do a task that would take me 10m in an imperative language. – Marko Grdinić Aug 14 '16 at 09:18
  • If you are really just parsing whitespace separated `Int`s, I think a parsing library might be too fancy. `Data.ByteString.Char8.readInt` is very fast. – Michael Aug 14 '16 at 11:17
  • It would be very helpful to see your ultimate goal and benchmark code. As Michael suggests, you may be attacking from the wrong angle. Attoparsec is designed to support streaming input and to be pure; these features may make it inappropriate for your purpose. – dfeuer Aug 14 '16 at 12:21
  • @Michael I was making the parser for a parsing library. If you take a look you will see that Attoparsec which is supposed to be fast has no array based parsers, so I thought I might do something about that. The speed is not so much an issue, but rather the fact that in the 100M case the memory gets eaten away way beyond the reasonable level. And even that is not the true problem - the real problem is the absence of control over memory in Haskell. – Marko Grdinić Aug 14 '16 at 12:28
  • @dfeuer My most [successful attempt](https://github.com/mrakgr/futhark/blob/parser_attempts/unfoldr_v6.hs) was here, so play with it if you want. It might be interesting if you can get it to perform on par with the [imperative F# solution](https://github.com/mrakgr/futhark/blob/parser_attempts/sum.fsx). Here is the [random integer generator](https://github.com/mrakgr/futhark/blob/parser_attempts/generate.fsx). When bechmarking the parser don't do something stupid like creating the data inside the program like the guy on Reddit whose name won't be mentioned, but do load it from file instead. – Marko Grdinić Aug 14 '16 at 12:36
  • @dfeuer My goal was to figure out how to parse a large number of values efficiently into an array and as a side goal preferably integrate that parser into some parser library. Now that I think about it, due to it using the CPS style maybe a nominally slower library that uses `Either` for control flow might have been a better fit for extension. I am not interested in this anymore. If you want to, take it on as a challenge to see if you can succeed where I failed. – Marko Grdinić Aug 14 '16 at 12:39

1 Answers1

0

A few minutes after I made this post, it occurred to me that I made a syntactic error.

manyCPSVec :: Parser Text.Text Char -> Parser Text.Text (Vec.Vector Char)
manyCPSVec parser = T.Parser $ \t pos more lose_fin win_fin ->
      let arr = Unsafe.unsafePerformIO (MVec.new 1024) in
      loop 0 arr t pos more lose_fin win_fin where
          loop i (arr :: MVec.MVector RealWorld Char) t pos more lose_fin win_fin =
              T.runParser parser t pos more lose win where
                  win t !pos more (a :: Char) =
                    Unsafe.unsafePerformIO $ do
                      MVec.write arr i a
                      return $ loop (i+1) arr t pos more lose_fin win_fin
                  lose t pos more _ _ =
                    Unsafe.unsafePerformIO $ do
                      x <- Vec.freeze arr
                      return $ win_fin t pos more x

The above type checks ok. I forgot that I can't sequence statements outside the do blocks in Haskell.

Marko Grdinić
  • 3,798
  • 3
  • 18
  • 21