How would you traverse a directory and do some function on all files and combine the output in a memory efficient manner?

Question

Setting

I need to traverse a directory over 100+ .txt files, open every one and do some function on each, then combine the results. These files are huge, on the order of 10GB. Some common operation in psuedocode might be:

foldr concatFile mempty $ openFile <$> [filePath1, ..., filePathn]
foldr countStuff 0      $ openFile <$> [filePath1, ..., filePathn]

The trick is to make sure all the files never exist in memory at the same time, my previous naive solution created all kinds of swap files on my mac. In addition, if one of the filePath is invalid, I'd like to just skip over it and go on with the program.

My Solution

Currently I'm using conduit and would like to find a solution using conduit if possible. But if it's not the right tool I'm fine with using something else.

score 4 · Accepted Answer · answered Aug 13 '16 at 21:19

4

You can nest conduit execution like this:

{-# LANGUAGE OverloadedStrings #-}

import Conduit
import qualified Data.ByteString as BS

-- Process a single file
processFile :: FilePath -> IO ()
processFile path = runResourceT (sourceFile path =$= mapC BS.length $$ sumC) >>= print

-- Run processFile for directory in a tree    
doit :: FilePath -> IO ()
doit top = runResourceT $ sourceDirectoryDeep False top $$ mapM_C (liftIO . processFile)

Replace processFile with whatever you want to do -- including ignoring the file. My understanding is that the sourceFile Producer will efficiently chunk the contents of a file.

And, according to this Yesod article, sourceDirectoryDeep should efficiently traverse a directory structure.

The thing you apparently can't do with sourceDirectoryDeep is prune directories.

answered Aug 13 '16 at 21:19

ErikR

51,541
9
73
124

1

Unless the directory has a truly tremendous number of files, a fancy machine sounds like overkill. How about just reading the directory, then using `foldM` to process each, combining results along the way? – dfeuer Aug 13 '16 at 22:26
Don't know if the OP needs it, but `sourceDirectoryDeep` does a recursive traversal. But yes, the larger efficiency will come from processing the contents of each file in a chunked fashion. – ErikR Aug 13 '16 at 22:33
2

@dfeuer conduit (or pipes) `foldM` is *not even slightly* more complicated than `Control.Monad.foldM`. The amazing number of catastrophes that can occur by working with a list of Filepaths lazily developed from a directory traversal was one of the original poster children for streaming io. It *just isn't simpler* and shouldn't be recommended. `import Conduit` is shorter than `import Control.Monad`, and once you type it, you have both the directory tree and the correct fold function to apply to it. – Michael Aug 14 '16 at 02:34
tip for Haskell newbies: if you are basically using `readFile` as `processFile` in the above example, be careful which `readFile` you are using as there are many possible output types for different `readFile`s, e.g.: `String`, `Text`, `ByteString` which might cause slightly confusing type errors if you use the wrong one. – bbarker Jan 15 '19 at 16:14

How would you traverse a directory and do some function on all files and combine the output in a memory efficient manner?

1 Answers1