3

I am trying to read a large vector of custom data type from a binary file. I tried to use the example given here.

The trouble with the example code is, it uses lists and I want to use vectors. So I adapted that code as below, but it takes very long time (more than a minute, I gave up after that) to read even 1 MB file.

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec

main = do
  b <- BIN.readFile "dat.bin" -- about 1 MB size file
  let v = runGet getPairs (BIN.tail b) -- skip the first byte
  putStrLn $ show $ Vec.length v

getPair :: Get (Word8, Word8)
getPair = do
  price <- getWord8
  qty <- getWord8
  return (price, qty)

getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
 empty <- isEmpty
 if empty
   then return Vec.empty
   else do pair  <- getPair
           pairs <- getPairs
           return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?

When I tried to run it with ghc --make -O2 pairs.hs I got the error Stack space overflow: current size ...

How to efficiently read pairs of values from bytestring into vector?

Again, I wish to get complete working code not just only pointers to Haskell site or RWH nor a just list of function/module names.

mntk123
  • 905
  • 6
  • 18
  • 3
    you shouldn't _demand_ a complete working solution as answer. SO is a learning and problem understanding site not a crowd sourced coding service. – Erik Kaplun Sep 19 '15 at 13:30
  • 1
    @ErikAllik I am demanding complete working solution in the same spirit of Smallest Complete Working Example. Also I do give minimum working example. Another reason is the rarity of working examples on the Haskell sites. The problems I ask solutions for are generally very small snippets, hardly sufficient to accusing me of crowd sourcing code. The SO is great and Haskell docuemntation is poor is why I need to make that _demand_. I apologize if it has hurt/offended or caused undue distress to someone. – mntk123 Sep 19 '15 at 14:11
  • Yes, this is slow because `V.cons` is O(n). What did you expect? A million times a million is a lot! `vector` has perfectly good documentation. Note that the documentation for a Hackage module generally has a table of contents in the upper right. Check out the section on constructing vectors. – dfeuer Sep 19 '15 at 14:53
  • @dfeuer I expected it to be slow. But I didn't know any other method to construct the vector by incrementally adding elements to it as I go on reading values from the bytestring. As for why did I ask the question: I didn't understand any other method given on the hackage that can allow me to incrementally construct the vector as desired. – mntk123 Sep 19 '15 at 16:18
  • 1
    Well, you basically *can't* construct a vector by incrementally adding to it! They're more like C or C++ arrays or FORTRAN vectors than like the flexible things some other environments call vectors. You can get such flexible things in Haskell if you like; the traditional one is `Data.Sequence`, but Ed Kmett's been working on some faster ones lately. – dfeuer Sep 19 '15 at 16:47

2 Answers2

2

Here are a couple of examples of creating Vectors from files. They are not the most efficient, but both run in just a couple of seconds in ghci.

module Main where

import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix

getFileSize :: String -> IO Int
getFileSize path = do
    stat <- getFileStatus path
    return (fromEnum $ fileSize stat)

readVector1 path = do
  size <- getFileSize path
  withBinaryFile path ReadMode $ \h -> do
    -- can also use: size <- hFileSize h
    let go _ = do bs <- BS.hGet h 2 
                  return (BS.index bs 0, BS.index bs 1)
    Vec.generateM (div size 2) go

pairs (a:b:rest) = (a,b) : pairs rest
pairs _          = []

readVector2 path = do
  contents <- BIN.readFile path
  -- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
  let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
        where go [] = Nothing
              go (p:ps) = Just (p, ps)
  return v

main = do
  v <- readVector1 "rand" -- large file
  print $ Vec.length v
  v <- readVector2 "rand"
  print $ Vec.length v

A third alternative:

readVector3 path = do
  contents <- BS.readFile path
  let size = BS.length contents
      v = Vec.generate (div (fromIntegral size) 2) go
            where go i = let a = BS.index contents (2*i)
                             b = BS.index contents (2*i+1)
                         in (a,b)
  return v

This one turns out to be the fastest of the three.

ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Note that using `size <- hFileSize h` might be a more portable way to get a file's size and doesn't require `System.Posix` – ErikR Sep 19 '15 at 13:24
  • the `readVector2` function uses lists and `readVector1` is too much work in IO. – mntk123 Sep 19 '15 at 13:31
  • Both run in less than a second when compiled on my machine. How long does it take on yours? – ErikR Sep 19 '15 at 13:35
  • @user5402: You changed quite a bit about the code - would you mind explaining what the most important change(s) was/were for improving the performance? (I'm curious) – Sam van Herwaarden Sep 19 '15 at 13:58
  • 1
    I think the main takeaway is that Vec.cons is probably inefficient. At first I though you should tell Vector the length you need - hence the first version. However, `unfold` also works pretty well. The main difference between `unfoldr` and `cons` is that `unfoldr` appends to a vector which is a more natural operation and thus I think more efficient. `cons`, however, I believe has to prepend to a vector which is not the natural way for vectors to grow. – ErikR Sep 19 '15 at 14:51
  • No. unfoldr just uses the traditional array doubling trick to achieve O(1) amortized time per `snoc` even though it may have to double the size of the array several times, copying the whole thing each time. – dfeuer Sep 19 '15 at 15:13
  • Yeah - but my point it that it seems to be a _lot_ better than using `cons`. – ErikR Sep 19 '15 at 15:21
  • 1
    Well yes, certainly it is! But the same technique could be applied to fill a vector from back to front instead. `cons` has to copy the whole vector each time. `unfoldr` saves itself some wiggle room so it can afford the occasional copy without blowing the asymptotic bound. Prepend vs. append has nothing to do with it. – dfeuer Sep 19 '15 at 16:40
  • @user5402 I liked and upvoted your answer but what bothers me is the IO part in the `readVector1`. If you give the code for reading vector that just takes ByteString (you can find its size) and then returns a vector filled in with the values I will accept it. I am struggling myself to get that type of code using your code but unable to remove the IO from it. – mntk123 Sep 19 '15 at 16:44
  • Ok - watch this space. – ErikR Sep 19 '15 at 17:02
  • Have a look at `readVector3`. Adjust the offset formulas `2*i`, `2*i+1` and the length determination `div size 2` to account for any header bytes you want to skip over. – ErikR Sep 19 '15 at 17:09
  • @user5402 I accept this answer. I cannot upvote it more than once. You have taught me many things thru the answer. – mntk123 Sep 19 '15 at 17:31
2

Here's an alternative approach for loading the vector, that uses pipes and pipes-bytestring to stream the file, and the vector function from foldl to create the vector:

{-# LANGUAGE PackageImports #-}
import Data.Functor (void)
import "pipes" Pipes
import qualified "pipes" Pipes.Prelude as P
import qualified "pipes-bytestring" Pipes.ByteString as B
import qualified "pipes-binary" Pipes.Binary as B
import qualified "vector" Data.Vector.Unboxed as V
import qualified "foldl" Control.Foldl as L
import "lens-family-core" Lens.Family (view)
import System.IO

main :: IO ()
main = do
    v <- withBinaryFile "somefile" ReadMode (\h ->
        -- for simplicity, errors are ignored with "void"
        L.impurely P.foldM L.vector (void (view B.decoded (B.drop 1 (B.fromHandle h)))))
    print (V.length (v::V.Vector (B.Word8,B.Word8)))

cons is inefficient. The approach taken by foldl's vector is to progressively double the vector's capacity using unsafeGrow, in order to accomodate incoming values, and at the end "trim" any excess capacity with unsafeTake.

danidiaz
  • 26,936
  • 4
  • 45
  • 95
  • It's an elegant solution - I really like it, but when I benchmark it takes about 2 secs to consume a 1 MB file whereas the reading two bytes at a time method takes about about 0.1 secs. Is that what you're seeing? – ErikR Sep 19 '15 at 16:57