Idea. Read several files line by line, concatenate them, process the list of lines in all files.
Implementation. This can be implemented this way:
import qualified Data.ByteString.Char8 as B
readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile
...
main = do
files <- getArgs
allLines <- readFiles files
Problem. This works unbearably slow. What's notable, the real or user time is several orders higher than system time (measured using UNIX time
), so I suppose the problem is in spending too much time in IO.
I didn't manage to find a simple and effective way to solve this problem in Haskell.
For instance, processing two files (30.000 lines and 1.2M each) takes
20.98 real 18.52 user 0.25 sys
This is the output when running +RTS -s
:
157,972,000 bytes allocated in the heap
6,153,848 bytes copied during GC
5,716,824 bytes maximum residency (4 sample(s))
1,740,768 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 295 colls, 0 par 0.01s 0.01s 0.0000s 0.0006s
Gen 1 4 colls, 0 par 0.00s 0.00s 0.0010s 0.0019s
INIT time 0.00s ( 0.01s elapsed)
MUT time 16.09s ( 16.38s elapsed)
GC time 0.01s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 16.11s ( 16.41s elapsed)
%GC time 0.1% (0.1% elapsed)
Alloc rate 9,815,312 bytes per MUT second
Productivity 99.9% of total user, 98.1% of total elapsed
16.41 real 16.10 user 0.12 sys
Why is concatenating files using the code above is so slow?
How should I write readFiles
function in Haskell to make it faster?