Reading and processing multiple files in Haskell

Question

Idea. Read several files line by line, concatenate them, process the list of lines in all files.

Implementation. This can be implemented this way:

import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile

...

main = do
    files <- getArgs
    allLines <- readFiles files

Problem. This works unbearably slow. What's notable, the real or user time is several orders higher than system time (measured using UNIX time), so I suppose the problem is in spending too much time in IO. I didn't manage to find a simple and effective way to solve this problem in Haskell.

For instance, processing two files (30.000 lines and 1.2M each) takes

   20.98 real        18.52 user         0.25 sys

This is the output when running +RTS -s:

     157,972,000 bytes allocated in the heap
       6,153,848 bytes copied during GC
       5,716,824 bytes maximum residency (4 sample(s))
       1,740,768 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       295 colls,     0 par    0.01s    0.01s     0.0000s    0.0006s
  Gen  1         4 colls,     0 par    0.00s    0.00s     0.0010s    0.0019s

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time   16.09s  ( 16.38s elapsed)
  GC      time    0.01s  (  0.02s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   16.11s  ( 16.41s elapsed)

  %GC     time       0.1%  (0.1% elapsed)

  Alloc rate    9,815,312 bytes per MUT second

  Productivity  99.9% of total user, 98.1% of total elapsed

       16.41 real        16.10 user         0.12 sys

Why is concatenating files using the code above is so slow? How should I write readFiles function in Haskell to make it faster?

Could you add a well-formulated question with some evidence? Like "Why does file concatenation with ByteStrings is so slow?" — Zeta, Nov 11 '15 at 11:12
@Zeta I updated the text with some time measurements and questions. — Daniel, Nov 11 '15 at 11:32
What does the program do with the lines *after* it concatenates them? — MathematicalOrchid, Nov 11 '15 at 11:37
@MathematicalOrchid Some trivial string processing like splitting the string and extracting several fields. This processing takes little time, as indicated by `0.25 sys` in the `time` output above. — Daniel, Nov 11 '15 at 11:43
@MathematicalOrchid Sure, I've tried to remove any string processing steps and just parse those files using the code above, and the problem is still here. — Daniel, Nov 11 '15 at 11:48
Have a go at running the program with the RTS `-s` option and tell us what you see. — MathematicalOrchid, Nov 11 '15 at 11:59
Nothing obviously wrong. See [this](http://book.realworldhaskell.org/read/profiling-and-optimization.html). — PyRulez, Nov 11 '15 at 12:13
@Daniel Thanks. OK, so you stats say that your program is using a sane amount of RAM, and GC time isn't what's killing you. This is good. At this point, I don't think the bytestring concatenation is what's taking so long. But without seeing the rest of your program, I can't really say for sure... The next step is probably to read the GHC manual and turn on some profiling options to see which function(s) are taking too long. — MathematicalOrchid, Nov 11 '15 at 12:27
I think the issue lies in 'B.concat'. If you can avoid this and instead work on list of ByteString or use Data.ByteString.Builder. — dfordivam, Nov 11 '15 at 13:33
@dfordivam Thanks for a nice idea. Removing `B.concat` resulted in a small speed-up but didn't help much. — Daniel, Nov 11 '15 at 15:22
Just to make sure are you compiling this with optimizations on? — epsilonhalbe, Nov 11 '15 at 16:24
Please include precisely how you compiled and executed your code, and a minimum working example. This really isn't reproducible. — user2407038, Nov 11 '15 at 16:51
@epsilonhalbe Thanks for your comment. The optimizations are on (`-O2`). — Daniel, Nov 11 '15 at 17:12
@user2407038 Thank you for commenting. You're right, I should have provided these details. It seems though I've managed to find what's causing the slowness in my case. (It was `Data.List.nub`.) — Daniel, Nov 11 '15 at 17:38
@Daniel Indeed, the process of breaking down your bug into a minimum working example will not only help other people with your bug, but in probably 90% of cases it will lead you to discover what the bug is yourself. — user2407038, Nov 11 '15 at 17:47

score 3 · Accepted Answer · answered Nov 11 '15 at 16:15

You should show us exactly what your processing steps are.

This program is very performant even when run on multiple input files of the kind you are using (1.2 MB, 30k lines each):

import Control.Monad
import Data.List
import System.Environment
import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile

main = do
    files <- getArgs
    allLines <- readFiles files
    print $ foldl' (\s _ -> s+1) 0 (B.words allLines)

Here is how I created the input file:

import Control.Monad

main = do
  forM_ [1..30000] $ \i -> do
    putStrLn $ unwords ["line", show i, "this is a test of the emergency"]

Run times:

time ./program input               -- 27 milliseconds
time ./program input input         -- 49 milliseconds
time ./program input input input   -- 69 milliseconds

Thanks a lot! Your answer helped me to find the place in my processing steps that result in slow execution. Actually, it was `nub` that worked really slow on the lines of input files. The rest is indeed very fast. — Daniel, Nov 11 '15 at 17:39

Reading and processing multiple files in Haskell

1 Answers1