Haskell performance implementing unix's "cat" program with Data.ByteString

Question

I have the following Haskell code, implementing a simple version of the "cat" unix command-line utility. Testing performance with "time" on a 400MB file, it's about 3x slower. (the exact script I am using to test it is below the code).

My questions are:

Is this a valid test of performance?
How can I make this program run faster?
How can I identify performance bottlenecks in Haskell programs in general?

Regarding questions 2 and 3: I have used GHC -prof, then running with +RTS -p, but I'm finding the output a bit uninformative here.

Source (Main.hs)

module Main where

import System.IO
import System.Environment
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle buf = do
  hPut stdout buf
  eof <- hIsEOF handle
  unless eof $ do
    buf <- hGetSome handle bufsize
    go handle buf

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- hGetSome handle bufsize
  hSetBuffering stdin $ BlockBuffering (Just bufsize)
  hSetBuffering stdout $ BlockBuffering (Just bufsize)
  go handle buf

Timing script (run.sh):

#!/usr/bin/env bash

# Generate 10M lines of silly test data
yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge

# Compile with optimisation
ghc -O2 Main.hs

# Run haskell
echo "timing Haskell"
time ./Main huge > /dev/null

echo ""
echo ""

# Run cat
echo "timing 'cat'"
time cat huge > /dev/null

My results:

timing Haskell

real    0m0.980s
user    0m0.296s
sys     0m0.684s


timing 'cat'

real    0m0.304s
user    0m0.001s
sys     0m0.302s

The profiling report when compiling with -prof and running with +RTS -p is below:

  Sat Dec 13 21:26 2014 Time and Allocation Profiling Report  (Final)

     Main +RTS -p -RTS huge

  total time  =        0.92 secs   (922 ticks @ 1000 us, 1 processor)
  total alloc = 7,258,596,176 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.Conc.Signal           84           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          81           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    76           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           70           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     69           0    0.0    0.0     0.0    0.0

It's good to see that you have also added -O2 optimization. You can also post the link to the profiled report here. — Sibi, Dec 13 '14 at 21:13
Sibi: I've added it- but unfortunately it seems a bit opaque. The only thing that seems odd is perhaps the "total alloc" number being very large- when I would expect a bytestring buffer memory to be overwritten as an optimisation (i'm not sure if this is possible). — statusfailed, Dec 13 '14 at 21:20
Instead of `hGetSome` use `hGetNonBlocking`, it is increasing the performance substantially — Sibi, Dec 13 '14 at 21:25
Sibi: You are right, that is significantly faster- but still about 2x slower than "cat". Is it possible to get faster? — statusfailed, Dec 13 '14 at 21:30
@statusfailed Are you using the same encoding for both `cat` and your test program? If you use `LANG=C ./Main huge` does anything change? — Daniel Wagner, Dec 13 '14 at 21:33
@DanielWagner I have `en_US.utf8`, but running `LANG=C ./Main huge` has no effect on the run time. I'm under the impression that ByteString totally ignores encoding (and just reads strings of bytes) — statusfailed, Dec 13 '14 at 21:38
@statusfailed Yes, of course you're right. I saw `GHC.IO.Encoding` in the profiling output and all these warning bells went off in my head, but I didn't stop to think about them carefully. In my defense: it's hard to think straight with all these bells ringing! — Daniel Wagner, Dec 13 '14 at 21:44
Notice caching impacts the timing significantly. If I run `cat` first in the shell script "benchmark" instead of second it performs about 60% slower on my system. — Thomas M. DuBuisson, Dec 13 '14 at 21:59
@ThomasM.DuBuisson Yeah, that brings C and the Haskell version pretty close, but still there is some difference ( 1 ~ 1.5%) — Sibi, Dec 13 '14 at 22:11
`-prof` won’t be terribly useful if you don’t have any cost centers; try compiling it (and the whole `bytestring` package, too, if you can (probably in a Cabal sandbox)) with `-fprof-auto` and maybe `-fprof-cafs` in addition to `-prof`. — icktoofay, Dec 13 '14 at 22:36
@icktoofay This is very helpful, but unfortunately just highlights that everything is happening in the "go" function :-) — statusfailed, Dec 13 '14 at 22:50
@ThomasM.DuBuisson I still see a difference of 1.2x - 1.3x, but that does seem to help a little. Interestingly, if I directly pipe "yes | head" into Main, it seems to run about 100ms _faster_ than `cat`- regardless of the order I run them in. — statusfailed, Dec 13 '14 at 23:08
@statusfailed Changing `bufsize` to `1024*1024` also gives a slight boost in my system. — Sibi, Dec 13 '14 at 23:10
@statusfailed A [pipes solution](http://lpaste.net/116401) runs pretty close. I have also attached benchmark with the code. — Sibi, Dec 13 '14 at 23:19
Note that I have changed the benchmark to be more fair: http://pastebin.com/r7i19SSs — Sibi, Dec 13 '14 at 23:20
@Sibi I made a variant of `bytestring-mmap` that uses `Producer ByteString` instead of `Lazy.ByteString`, which might amuse. https://github.com/michaelt/pipes-bytestring-mmap — Michael, Dec 15 '14 at 17:57
@Arthur Wow, that is amazing! Can you post that as an answer, please. :) — Sibi, Dec 16 '14 at 07:09
@Sibi Yeah, I did, for the heck of it. Of course, I'm still trying to figure out what the limits and worthwhile uses of an mmapped bytestring producer are... — Michael, Dec 16 '14 at 19:10

bmk · Answer 1 · 2014-12-13T23:11:10.317

This is only a partial answer trying to address the second question:

I tried something like this using GHC.IO.Buffer API:

module Main where

import System.IO
import System.Environment
import GHC.IO.Buffer
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle bufPtr = do
  read <- hGetBuf handle bufPtr bufsize
  when (read > 0) $ do
    hPutBuf stdout bufPtr read
    go handle bufPtr

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- newByteBuffer bufsize WriteBuffer

  withBuffer buf $ go handle

and it seems to come closer to the performance of 'cat', but still definitely slower...

time ./Cat huge > /dev/null 
./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 total

time cat huge > /dev/null  
cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total

I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when using hGetSome like in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...

UPDATE: Adding the original code's performance on my laptop:

time ./Cat2 huge > /dev/null
./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total

UPDATE 2: Adding some basic profiling results:

Original Code:

Cat2 +RTS -p -RTS huge

    total time  =        0.21 secs   (211 ticks @ 1000 us, 1 processor)
    total alloc = 6,954,068,112 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.IO.Handle.FD          86           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal           82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           80           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 79           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    75           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     72           0    0.0    0.0     0.0    0.0

Buffer-API Code:

Cat +RTS -p -RTS huge

    total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)
    total alloc =   3,487,712 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0   98.9


                                                      individual     inherited
COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                     44           0  100.0   98.9   100.0  100.0
 CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0
 CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1
 CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0

Notice especially the big difference in allocation costs...

Note that there is also a fixed overhead of approximately 10 ms for every Haskell program, so if you subtract that from your Haskell timing it is even closer to `cat`. — Gabriella Gonzalez, Dec 13 '14 at 22:35
I know I didn't specify this in the question, but I'm hoping to go for a more idiomatic / "high level" Haskell solution. Upvoted for a good answer though. — statusfailed, Dec 13 '14 at 22:58

Thomas M. DuBuisson · Answer 2 · 2014-12-15T12:05:55.023

The original question made me think it was about finding a performance issue in the exact code provided. Since the comment "I'm hoping to go for a more idiomatic / "high level" Haskell solution" contradicts that assumption, I'll give the reasonably performing idiomatic Haskell solution.

The way I would expect any random programmer familiar with Haskell to solve this problem is with Lazy bytestrings. This allows the programmer to simply specify the task of reading input and putting output while letting the compiler worry about mucking with the buffering and looping constructs.

module Main where

import System.IO
import System.Environment
import Data.ByteString.Lazy as BS

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- BS.hGetContents handle
  hPut stdout buf

The result is both more readable and better performing than the code in the original question:

timing 'cat'

real    0m0.075s
user    0m0.000s
sys     0m0.074s
timing strict bytestring with GHC -O2

real    0m0.254s
user    0m0.126s
sys     0m0.127s
timing strict bytestring with GHC -O2 -fllvm

real    0m0.267s
user    0m0.132s
sys     0m0.134s
timing lazy bytestring with GHC -O2

real    0m0.091s
user    0m0.023s
sys     0m0.067s
timing lazy bytestring with GHC -O2 -fllvm

real    0m0.091s
user    0m0.021s
sys     0m0.069s

That is, the lazy bytestring solution is 21% slower than cat. Putting cat last for preferential caching behavior results in 59ms runtime placing the Haskell solution at 51% slower.

EDIT: Dons suggested using memory mapped IO would more accurately model cat's behavior. I'm not sure how accurate that statement is but mmap almost always results in better performance and this situation is certainly no exception:

timing memory mapped lazy bytestring with GHC -O2

real    0m0.008s
user    0m0.004s
sys     0m0.003s

Which was produced by:

module Main where

import System.IO (stdout)
import System.Environment
import System.IO.Posix.MMap.Lazy
import Data.ByteString.Lazy (hPut)

import Control.Monad

main :: IO ()
main = do
  file    <- fmap Prelude.head getArgs
  buf     <- unsafeMMapFile file
  hPut stdout buf

mmap-bytestring should be closer to `cat` in behavior, using the OS to do the lazy allocation. — Don Stewart, Dec 15 '14 at 08:55
@DonStewart mmap is always a great thing to remember wrt performance but that's a way to blow cat out of the water wrt performance. I'm not sure my gnu cat uses mmap... indeed strace shows huge numbers of reads and writes. — Thomas M. DuBuisson, Dec 15 '14 at 12:10
@DonStewart I actually tried bytestring-mmap, and it was pretty much on par with cat. Good advice! Again though, it makes me uncomfortable because it feels non-idiomatic (although Thomas M. DuBuisson is quite right to call me out on my self-contradiction) — statusfailed, Dec 15 '14 at 23:09
@ThomasM.DuBuisson Re: self contradiction- what I'd hoped is to more fully explore a space of solutions, so I could pick one that satisfies my notion of "idiomatic", but thanks for pointing this out :-) — statusfailed, Dec 15 '14 at 23:10

Michael · Answer 3 · 2014-12-16T19:16:59.213

Remark post festum:

I'm not sure what the question is now that people have booted it around a bit. I wanted to see what was up with bytestring-mmap, and so I made a pipes version to 'correct' its lazy bytestring module. https://github.com/michaelt/pipes-bytestring-mmap In consequence I assembled all of these programs, using sibis test method. The only two of the modules in https://github.com/michaelt/pipes-bytestring-mmap/tree/master/bench that seem like anything but dumb bread-and-butter haskell are the ones that use fancy explicit buffer management.

Anyway, here are some results: File size grows by 10* as we move to the right. It is interesting to see how far the programs differ at different file sizes. The programs that don't use mmap only start to show their character as 'linear in the length of the file' at 420M. At that point, and after, they are all almost exactly the same, suggesting that the rather divergent behavior at smaller sizes can't be taken too seriously. The mmap files all behave similarly (to each other) with a few curiosities (which I replicated) All this is on os x.

4200000           42000000          420000000         4200000000

timing 'cat'

real  0m0.006s    real  0m0.013s    real  0m0.919s    real  0m8.154s
user  0m0.002s    user  0m0.002s    user  0m0.005s    user  0m0.028s
sys   0m0.003s    sys   0m0.009s    sys   0m0.223s    sys   0m2.179s


timing lazy bytestring - idiomatic Haskell (following Thomas M. DuBuisson) 

real  0m0.009s    real  0m0.025s    real  0m0.894s    real  0m9.146s
user  0m0.002s    user  0m0.006s    user  0m0.078s    user  0m0.787s
sys   0m0.005s    sys   0m0.016s    sys   0m0.288s    sys   0m3.001s


timing fancy buffering following statusfailed

real  0m0.014s    real  0m0.066s    real  0m0.876s    real  0m8.686s
user  0m0.005s    user  0m0.028s    user  0m0.278s    user  0m2.724s
sys   0m0.007s    sys   0m0.035s    sys   0m0.424s    sys   0m4.232s


timing fancier use of GHC.Buf following bmk

real  0m0.011s    real  0m0.018s    real  0m0.831s    real  0m8.218s
user  0m0.002s    user  0m0.003s    user  0m0.034s    user  0m0.289s
sys   0m0.006s    sys   0m0.013s    sys   0m0.236s    sys   0m2.447s


timing Pipes.ByteString following sibi

real  0m0.012s    real  0m0.020s    real  0m0.845s    real  0m8.241s
user  0m0.003s    user  0m0.004s    user  0m0.020s    user  0m0.175s
sys   0m0.007s    sys   0m0.014s    sys   0m0.239s    sys   0m2.509s

Then with mmap

timing Lazy.MMap following dons and Thomas M. DuBuisson 

real  0m0.006s    real  0m0.006s    real  0m0.037s    real  0m0.133s
user  0m0.002s    user  0m0.002s    user  0m0.006s    user  0m0.051s
sys   0m0.003s    sys   0m0.003s    sys   0m0.013s    sys   0m0.061

timing Pipes.ByteString.MMap with SafeT machinery

real  0m0.006s    real  0m0.010s    real  0m0.051s    real  0m0.196s
user  0m0.002s    user  0m0.004s    user  0m0.012s    user  0m0.099s
sys   0m0.003s    sys   0m0.005s    sys   0m0.016s    sys   0m0.072s


timing Pipes.ByteString.MMap 'withFile' style

real  0m0.008s    real  0m0.008s    real  0m0.142s    real  0m0.134s
user  0m0.002s    user  0m0.002s    user  0m0.007s    user  0m0.046s
sys   0m0.004s    sys   0m0.004s    sys   0m0.016s    sys   0m0.066s

Haskell performance implementing unix's "cat" program with Data.ByteString

3 Answers3