3

Idea. Read several files line by line, concatenate them, process the list of lines in all files.

Implementation. This can be implemented this way:

import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile

...

main = do
    files <- getArgs
    allLines <- readFiles files

Problem. This works unbearably slow. What's notable, the real or user time is several orders higher than system time (measured using UNIX time), so I suppose the problem is in spending too much time in IO. I didn't manage to find a simple and effective way to solve this problem in Haskell.

For instance, processing two files (30.000 lines and 1.2M each) takes

   20.98 real        18.52 user         0.25 sys

This is the output when running +RTS -s:

     157,972,000 bytes allocated in the heap
       6,153,848 bytes copied during GC
       5,716,824 bytes maximum residency (4 sample(s))
       1,740,768 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       295 colls,     0 par    0.01s    0.01s     0.0000s    0.0006s
  Gen  1         4 colls,     0 par    0.00s    0.00s     0.0010s    0.0019s

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time   16.09s  ( 16.38s elapsed)
  GC      time    0.01s  (  0.02s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   16.11s  ( 16.41s elapsed)

  %GC     time       0.1%  (0.1% elapsed)

  Alloc rate    9,815,312 bytes per MUT second

  Productivity  99.9% of total user, 98.1% of total elapsed

       16.41 real        16.10 user         0.12 sys

Why is concatenating files using the code above is so slow? How should I write readFiles function in Haskell to make it faster?

Community
  • 1
  • 1
Daniel
  • 328
  • 3
  • 11
  • Could you add a well-formulated question with some evidence? Like "Why does file concatenation with ByteStrings is so slow?" – Zeta Nov 11 '15 at 11:12
  • @Zeta I updated the text with some time measurements and questions. – Daniel Nov 11 '15 at 11:32
  • What does the program do with the lines *after* it concatenates them? – MathematicalOrchid Nov 11 '15 at 11:37
  • @MathematicalOrchid Some trivial string processing like splitting the string and extracting several fields. This processing takes little time, as indicated by `0.25 sys` in the `time` output above. – Daniel Nov 11 '15 at 11:43
  • @MathematicalOrchid Sure, I've tried to remove any string processing steps and just parse those files using the code above, and the problem is still here. – Daniel Nov 11 '15 at 11:48
  • Have a go at running the program with the RTS `-s` option and tell us what you see. – MathematicalOrchid Nov 11 '15 at 11:59
  • Nothing obviously wrong. See [this](http://book.realworldhaskell.org/read/profiling-and-optimization.html). – PyRulez Nov 11 '15 at 12:13
  • @MathematicalOrchid I have updated my question. – Daniel Nov 11 '15 at 12:22
  • @PyRulez Could you elaborate and/or provide a solution? – Daniel Nov 11 '15 at 12:24
  • @Daniel Thanks. OK, so you stats say that your program is using a sane amount of RAM, and GC time isn't what's killing you. This is good. At this point, I don't think the bytestring concatenation is what's taking so long. But without seeing the rest of your program, I can't really say for sure... The next step is probably to read the GHC manual and turn on some profiling options to see which function(s) are taking too long. – MathematicalOrchid Nov 11 '15 at 12:27
  • I think the issue lies in 'B.concat'. If you can avoid this and instead work on list of ByteString or use Data.ByteString.Builder. – dfordivam Nov 11 '15 at 13:33
  • @dfordivam Thanks for a nice idea. Removing `B.concat` resulted in a small speed-up but didn't help much. – Daniel Nov 11 '15 at 15:22
  • Just to make sure are you compiling this with optimizations on? – epsilonhalbe Nov 11 '15 at 16:24
  • Please include precisely how you compiled and executed your code, and a minimum working example. This really isn't reproducible. – user2407038 Nov 11 '15 at 16:51
  • @epsilonhalbe Thanks for your comment. The optimizations are on (`-O2`). – Daniel Nov 11 '15 at 17:12
  • @user2407038 Thank you for commenting. You're right, I should have provided these details. It seems though I've managed to find what's causing the slowness in my case. (It was `Data.List.nub`.) – Daniel Nov 11 '15 at 17:38
  • 1
    @Daniel Indeed, the process of breaking down your bug into a minimum working example will not only help other people with your bug, but in probably 90% of cases it will lead you to discover what the bug is yourself. – user2407038 Nov 11 '15 at 17:47

1 Answers1

3

You should show us exactly what your processing steps are.

This program is very performant even when run on multiple input files of the kind you are using (1.2 MB, 30k lines each):

import Control.Monad
import Data.List
import System.Environment
import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile

main = do
    files <- getArgs
    allLines <- readFiles files
    print $ foldl' (\s _ -> s+1) 0 (B.words allLines)

Here is how I created the input file:

import Control.Monad

main = do
  forM_ [1..30000] $ \i -> do
    putStrLn $ unwords ["line", show i, "this is a test of the emergency"]

Run times:

time ./program input               -- 27 milliseconds
time ./program input input         -- 49 milliseconds
time ./program input input input   -- 69 milliseconds
ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Thanks a lot! Your answer helped me to find the place in my processing steps that result in slow execution. Actually, it was `nub` that worked really slow on the lines of input files. The rest is indeed very fast. – Daniel Nov 11 '15 at 17:39