2

I want to process a couple of hundred binary data chunks ("scenarios") for a Monte Carlo simulation. Each scenario consists of 1 million floats. Here's how I create a dummy binary file for the scenario data:

import Data.Binary 
import qualified Data.ByteString.Lazy as B
import Data.Array.Unboxed

scenSize = 1000000
scens = 100

main = do
    let xs = array (1,scenSize) [(i, 0.0) | i <- [1..scenSize]] :: UArray Int Float
    let l = take scens $ Prelude.repeat xs
    B.writeFile "bintest.data" (encode l)
    return ()

This works fine. Now I want to process the scenarios. Since there can be really a lot of scenarios (scens=1000 or so), the processing should be done lazily one chunk at a time. I tried decodeFile, but this does not seem to work:

import Data.Binary 
import qualified Data.Array.IArray as IA
import Data.Array.Unboxed as A

main = do
    bs <- decodeFile "bintest.data" :: IO [UArray Int Float]
    mapM_ doStuff bs
    return ()

doStuff b = 
    Prelude.putStrLn $ show $ b IA.! 100000

This program seems to first load all the data in memory, and then prints all the numbers at the end of the run. It also uses a lot of memory and crashes for scens=500 on my 32-bit Ubuntu machine.

What am I doing wrong? Is there an easy way to make the program run lazily?

martingw
  • 4,153
  • 3
  • 21
  • 26

1 Answers1

4

decodeFile is not lazy, just look at the source -it calls decodeOrFail, which itself must parse the whole file to determine success or failure.

EDIT:

So what I believe worked in the original binary is now broken (read: it's now a non-lazy memory hog). One solution that I doubt is optimally pretty is to use lazy readFile and runGetIncremental then manually push chunks into the decoder:

import Data.Binary
import Data.Binary.Get
import Data.ByteString.Lazy as L
import Data.ByteString as B
import qualified Data.Array.IArray as IA
import Data.Array.Unboxed as A

main = do
    bs <- getListLazy `fmap` L.readFile "bintest2.data"
    mapM_ doStuff bs
    return ()

doStuff b = print $ b IA.! 100000

The important stuff is here:

getListLazy :: L.ByteString -> [UArray Int Float]
getListLazy lz = go decodeUArray (L.toChunks lz)
  where
    go :: Decoder (UArray Int Float) -> [B.ByteString] -> [UArray Int Float]
    go _ []       = []
    go dec (b:bs) =
      case pushChunk dec b of
        Done b' o a -> a : go decodeUArray (b' : bs)
        Partial f   -> case bs of
                          (x:xs) -> go (f $ Just x) xs
                          []     -> []
        Fail _ _ s -> error s -- alternatively use '[]'

    decodeUArray :: Decoder (UArray Int Float)
    decodeUArray = runGetIncremental get

Notice this solution didn't bother decoding then plumbing the list length through the decoder - I just changed up your generator code to output numerous arrays and not a list of arrays.

To avoid code like this I think pipes would be the way to go.

Thomas M. DuBuisson
  • 64,245
  • 7
  • 109
  • 166
  • Thanks! This seems to be a rather harsh limitation, so I'm wondering if there's a easy way around it. For the time being, I'll try the following: Writing: Instead of encoding and writing a list of UArrays, I'll encode and write each UArray at a time. Reading: I'll read a chunk with defaultChunkSize, see if it decodes, otherwise read a bit more, and doStuff every time I have a complete chunk. – martingw Aug 23 '13 at 16:24
  • 1
    I was thinking on this and was surprised to find `fmap decode (readFile f)` doesn't do it (it doesn't decode lists lazily). Now that I see this isn't trivially solved I'll give it another go in my free time. – Thomas M. DuBuisson Aug 23 '13 at 16:31
  • 1
    @martingw Ok, I've posted an ugly but operational version that keeps the memory down. Notice that at this point you'd probably be well served by learning and using pipes (`readFileP >-> runStateP [] decode >-> doStuffP` or some such) – Thomas M. DuBuisson Aug 24 '13 at 01:43
  • Wow, thanks for putting so much effort into this!! This is amazing! – martingw Aug 24 '13 at 07:49