2

I have a simulation library that uses the FFI wrapped in a monad M, carrying a context. All the foreign functions are pure, so I've decided to make the monad lazy, which is normally convenient for flow-control. I represent my simulation as a list of simulation-frames, that I can consume by either writing to a file, or by displaying the frame graphically.

simulation :: [(Frame -> M Frame)] -> Frame -> M [Frame] 
simulation [] frame = return [frame]
simulation (step:steps) frame 
   = step frame >>= fmap (frame:) . simulation steps 

Each frame consists of a tuple of newtype-wrapped ForeignPtrs that I can lift to my Haskell representation with

lift :: Frame -> M HFrame

Since the time-steps in my simulation are quite short, I only want to look at every n frames, for which I use

takeEvery n l = foldr cons nil l 0 where
    nil _ = []
    cons x rest 0 = x : rest n
    cons x rest n = rest (n-1)

So my code looks something like

main = consume 
     $ takeEvery n 
     $ runM 
     $ simulation steps initialFrame >>= mapM lift

Now, the problem is that as I increase n, a thunk builds up. I've tried a couple of different ways to try to strictly evaluate each frame in simulation, but I have yet to figure out how to do so. ForeignPtr doesn't appear to have a NFData instance, so I can't use deepseq, but all my attempts with seq, including using seq on each element in the tuple, have been without noticeable effect.

EDIT:

Upon request, I have included more specifics, that I initially excluded since I think they are probably mostly noise for this question.

The monad

newtype FT c a = FT (Context -> a)

instance Functor (FT c) where
    fmap f (FT a) = FT (f.a)

instance Applicative (FT c) where
    pure a = FT (\_ -> a)
    (<*>) (FT a) (FT b) = FT (\c -> a c $ b c)

instance Monad (FT c) where
    return = pure
    (>>=) (FT a) f = FT (\c -> (\(FT b) -> b c) $ f $ a c)

runFTIn :: Context -> (forall c. FT c a) -> a
runFTIn context (FT a) = a context


runFTWith :: [ContextOption] -> (forall c. FT c a) -> a
runFTWith options a
    = unsafePerformIO
    $ getContext options >>= \c -> return $ runFTIn c a
runFT = runFTWith []

unsafeLiftFromIO :: (Context -> IO a) -> FT c a
unsafeLiftFromIO a = FT (\c -> unsafePerformIO $ a c)

All the foreign functions are lifted from IO with unsafeLiftFromIO

newtype Box c = Box (ForeignPtr RawBox)
newtype Coordinates c = Coordinates (ForeignPtr RawCoordinates)
type Frame c = (Box c, Coordinates c)

liftBox :: Box c -> FT c HBox
liftCoordinates :: Coordinates c -> FT c HCoordinates
liftFrame (box, coordinates) = do
    box' <- liftBox box
    coordinates' <- liftCoordinates coordinates
    return (box', coordinates') 

The steps themselves are supposed to be arbitrary (Frame c -> FT c (Frame c)), so strictness should preferably be in the higher level code.

EDIT2:

I have now tried to use Streamly, however the problem persists, so I think the issue really is finding a way to strictly evaluate ForeignPtrs.

current implementations:

import Streamly
import qualified Streamly.Prelude as S
import qualified Streamly.Internal.Data.Stream.Serial as Serial

takeEvery n = Serial.unfoldrM ((fmap.fmap) (\(h, t) -> (h, S.drop (n-1) t)) . S.uncons)

(#) = flip ($)
simulation
    :: (IsStream t)
    => Frame c 
    -> t (FT c) (Frame c -> FT c (Frame c)) 
    -> t (FT c) (Frame c)
simulation frame = S.scanlM' (#) frame

EDIT3:

To clarify the symptoms and how I have diagnosed the problem.

The library calls OpenCL functions running on a GPU. I am sure that the freeing of the pointers is handled correctly - the ForeignPtrs have the correct freeing functions, and memory use is independent of total number of steps as long as this number is larger than n. What I find is that memory use on the GPU is basically linearly correlated to n. The consumer I've been using for this testing is

import qualified Data.ByteString.Lazy as BL
import Data.Binary
import Data.Binary.Put

writeTrajectory fn = fmap (BL.writeFile fn . runPut) . S.foldr ((>>).putFrame) (pure ()) . serially

For my streamly implementation, and

writeTrajectory fn = BL.writeFile fn . runPut . MapM_ putFrame

For the original implementation. Both should consume the stream continuously. I've generated the steps for testing with replicate.

I am unsure of how to more precisely analyze the memory-use on the GPU. System memory use is not an issue here.

Update: I am starting to think it's not a matter of strictness, but of GC-problems. The run-time system does not know the size of the memory allocated on the GPU and so does not know to collect the pointers, this is less of an issue when there is stuff going on CPU-side as well, as that will produce allocations too, activating the GC. This would explain the slightly non-determinstic memory usage, but linear correlation to n that I've seen. How too solve this nicely is another issue, but I suspect there will be a substantial overhaul to my code.

notBob
  • 143
  • 8
  • I suspect you're going to have to include some more of your code to get useful answers. What do `M` and its `Monad` and `MonadTrans` instances look like? What does `step` look like? Etc. – dfeuer Sep 05 '20 at 03:02
  • 1
    Fair enough. I've added some additional code. I wanted to avoid adding to much stuff that I don't think necessarily helps describe the problem. – notBob Sep 05 '20 at 06:25
  • 2
    Perhaps switching to a monad transformer expressly designed for streaming http://hackage.haskell.org/package/streaming could help. Something like `simulation :: [(Frame -> m Frame)] -> Frame -> Stream (Of Frame) m ()`. Each frame would be yielded individually http://hackage.haskell.org/package/streaming-0.2.3.0/docs/Streaming-Prelude.html#v:yield instead of being returned in a list at the end. – danidiaz Sep 05 '20 at 08:57
  • @notBob For streamly's version of `writeTrajectory`, perhaps we could try using a function like `S.mapM_` http://hackage.haskell.org/package/streamly-0.7.2/docs/Streamly-Internal-Data-Stream-StreamK.html#v:mapM_ instead of `S.foldr`. That `foldr` looks a bit fishy, it seems to be constructing a gigantic `IO` action that puts all the frames in one go. Also, I don't know what `runPut` does. – danidiaz Sep 05 '20 at 18:43
  • It does not appear to cause any issues for small `n`, and I have not yet found a way to use `S.mapM_` for that purpose. It got a bit trickier when dealing with the transformer, which I am not used to. I agree it looks weird though. – notBob Sep 05 '20 at 19:06

1 Answers1

1

I think the issue really is finding a way to strictly evaluate ForeignPtrs

If that is really the issue, one way to do that is to change the second clause of simulation:

{-# LANGUAGE BangPatterns #-}

simulation :: [(Frame -> M Frame)] -> Frame -> M [Frame] 
simulation [] frame = return [frame]
simulation (step:steps) frame@(!_, !_)  -- Evaluate both components of the pair
   = step frame >>= fmap (frame:) . simulation steps 
Li-yao Xia
  • 31,896
  • 2
  • 33
  • 56
  • I just tried it. It has no effect, and I think this is equivalent to things I've tried previously. I wonder what the problem could be... – notBob Sep 05 '20 at 17:26
  • It's probably useful to give more details about how the issue was diagnosed, to prevent any misunderstanding. For instance, are we really talking about thunks (i.e., some function being too lazy), or are we talking about resources acquired by the external library that are not being freed, or do we only know that space is being wasted? Two other missing pieces are the definitions of `consume` and `steps`. If `consume` forces the whole list before doing anything, well that would be a good reason to take space. The way the list `steps` is produced could also affect space usage. – Li-yao Xia Sep 05 '20 at 18:03