3

I'm writing a program that creates a shell script containing one command for each image file in a directory. There are 667,944 images in the directory, so I need to handle the strictness/laziness issue properly.

Here's a simple example that gives me Stack space overflow. It does work if I give it more space using +RTS -Ksize -RTS, but it should be able run with little memory, producing output immediately. So I've been reading the stuff about strictness in the Haskell wiki and the wikibook on Haskell, trying to figure out how to fix the problem, and I think it's one of the mapM commands that is giving me grief, but I still don't understand enough about strictness to sort the problem.

I've found some other questions on SO that seem relevant (Is mapM in Haskell strict? Why does this program get a stack overflow? and Is Haskell's mapM not lazy?), but enlightenment still eludes me.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "#!/bin/sh"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  let imageFiles = filter (`notElem` [".", ".."]) files
  commands <- mapM (genCommand indir outdir) imageFiles
  mapM_ putStrLn commands

EDIT: TEST #1

Here's the newest version of the example.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
import Control.Monad ((>=>))

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "TEST 1"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  putStrLn $ show (length files)
  let imageFiles = filter (`notElem` [".", ".."]) files
  -- mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
  mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles

I compile it with the command ghc --make -O2 amy2.hs -rtsopts. If I run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat, I get

TEST 1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

If I instead run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat +RTS -K20M, I get the correct output...eventually:

TEST 1
667946
convert /home/amy/nosync/GalaxyZoo/table2/images//587736546846572812.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736546846572812.jpeg
convert /home/amy/nosync/GalaxyZoo/table2/images//587736542558617814.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736542558617814.jpeg

...and so on.

Community
  • 1
  • 1
mhwombat
  • 8,026
  • 28
  • 53
  • 1
    Is `genCommand` actually doing any I/O in the first place? Why use `mapM` if `map` would work? – C. A. McCann Apr 18 '13 at 15:20
  • In this example, genCommand isn't actually doing any IO. But in my real application, genCommand will read the file in order to compute the appropriate parameters to use in the command it generates. So I think I will need to use mapM. – mhwombat Apr 18 '13 at 15:23
  • Try replacing `mapM` with `safeMapM` from http://stackoverflow.com/questions/15546216/is-using-mapm-sequence-considered-good-practice – Mikhail Glushenkov Apr 18 '13 at 15:27
  • Another possible solution is to use some iteratee library (e.g. `pipes` or `conduit` to process list elements as they are generated). – Mikhail Glushenkov Apr 18 '13 at 15:30
  • I tried safeMapM, but I still get stack overflow. I'll try `pipes` or `conduits`. – mhwombat Apr 18 '13 at 15:32
  • 1
    You need to also stream the list of filenames or you will leak space. – Gabriella Gonzalez Apr 18 '13 at 16:50

1 Answers1

6

This isn't really a strictness issue(*), but an order of evaluation issue. Unlike lazily evaluated pure values, monadic effects must happen in deterministic order. mapM executes every action in the given list and gathers the results, but it cannot return until the whole list of actions is executed, so you don't get the same streaming behavior as with pure list functions.

The easy fix in this case is to run both genCommand and putStrLn inside the same mapM_. Note that mapM_ doesn't suffer from the same issue since it is not building an intermediate list.

mapM_ (genCommand indir outdir >=> putStrLn) imageFiles

The above uses the "kleisli composition operator" >=> from Control.Monad which is like the function composition operator . except for monadic functions. You can also use the normal bind and a lambda.

mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles

For more complex I/O applications where you want better composability between small, monadic stream processors, you should use a library such as conduit or pipes.

Also, make sure you are compiling with either -O or -O2.

(*) To be exact, it is also a strictness issue, because in addition to building a large, intermediate list in memory, laziness causes mapM to build unnecessary thunks and use up stack.

EDIT: So it seems the main culprit might be getDirectoryContents. Looking at the function's source code, it essentially does the same kind of list accumulation internally as mapM.

In order to do streaming directory listing, we need to use System.Posix.Directory which unfortunately makes the program incompatible with non-POSIX systems (like Windows). You can stream the directory contents by e.g. using continuation passing style

import System.Environment (getArgs)
import Control.Monad ((>=>))

import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
import Control.Exception (bracket)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
    " -crop 143x143+140+140 " ++ outfile

streamingDirContents :: FilePath -> (FilePath -> IO ()) -> IO ()
streamingDirContents root cont = do
    let loop stream = do
            fp <- readDirStream stream
            case fp of
                [] -> return ()
                _   | fp `notElem` [".", ".."] -> cont fp >> loop stream
                    | otherwise -> loop stream
    bracket (openDirStream root) loop closeDirStream


main :: IO ()
main = do
  putStrLn "TEST 1"
  (indir:outdir:_) <- getArgs
  streamingDirContents indir (genCommand indir outdir >=> putStrLn)

Here's how you could do the same thing using conduit:

import System.Environment (getArgs)

import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)

import Data.Conduit
import qualified  Data.Conduit.List as L
import Control.Monad.IO.Class (liftIO, MonadIO)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
    " -crop 143x143+140+140 " ++ outfile

dirSource :: (MonadResource m, MonadIO m) => FilePath -> Source m FilePath
dirSource root = do
    bracketP (openDirStream root) closeDirStream $ \stream -> do
        let loop = do
                fp <- liftIO $ readDirStream stream
                case fp of
                    [] -> return ()
                    _  -> yield fp >> loop
        loop

main :: IO ()
main = do
    putStrLn "TEST 1"
    (indir:outdir:_) <- getArgs
    let files    = dirSource indir $= L.filter (`notElem` [".", ".."])
        commands = files $= L.mapM (liftIO . genCommand indir outdir)

    runResourceT $ commands $$ L.mapM_ (liftIO . putStrLn)

The nice thing about conduit is that you regain the ability to compose pieces of functionality with things like conduit versions of filter and mapM. The $= operator streams stuff forward in the chain and $$ connects the stream to a consumer.

The not-so-nice thing is that real world is complicated and writing efficient and robust code requires us to jump through some hoops with resource management. That's why all the operations work in the ResourceT monad transformer which keeps track of e.g. open file handles and cleans them up promptly and deterministically when they are no longer needed or e.g. if the computation gets aborted by an exception (this is in contrast to using lazy I/O and relying on the garbage collector to eventually release any scarce resources).

However, this means that we a) need to run the final resulting conduit operation with runResourceT and b) we need to explicitly lift I/O operations to the transformed monad using liftIO instead of being able to directly write e.g. L.mapM_ putStrLn.

shang
  • 24,642
  • 3
  • 58
  • 86
  • Unfortunately, neither of those fixes made the stack overflow go away. (I am using -O, and I tried -O2 as well.) But your explanation really helped me understand the problem better, and it was cool to learn about the `>=>` operator. – mhwombat Apr 18 '13 at 15:42
  • Hmm, I don't see any reason why this would overflow the stack with `mapM_`. Are you running the exact code that you posted or something else? How many files do you get from `getDirectoryContents`? – shang Apr 18 '13 at 15:59
  • I was running the exact code shown, although subsequently I've modified it to try out the changes suggested here. I've edited the question to show the current version. `getDirectoryContents` gives me 667946 files, which is a lot. – mhwombat Apr 18 '13 at 16:27
  • Two of the "files are" `.` and `..`, but they are filtered out, so only 667944 files are processed. – mhwombat Apr 18 '13 at 16:36
  • I updated the answer with streaming solutions to `getDirectoryContents` – shang Apr 18 '13 at 18:06
  • @shang Your initial solution works fine for me on Linux. Perhaps it's a Windows-specific problem? – Mikhail Glushenkov Apr 18 '13 at 18:10
  • @Mikhail I'm on Linux too. The problem only occurs when you have a huge number of files. – mhwombat Apr 18 '13 at 18:18