2

I found that the following Haskell code uses 100% CPU and takes about 14secs to finish on my Linux server.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.ByteString.Lazy.Char8 as L
import System.IO

str = L.pack "FugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFuga\n"

main = do
  hSetBuffering stdout (BlockBuffering (Just 1000))
  sequence (take 1000000 (repeat (L.hPutStr stdout str >> hFlush stdout)))
  return ()

On the other hand, very similar Python code finishes the same task in about 3secs.

import sys

str = "FugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFuga\n"

def main():
    for i in xrange(0, 1000000):
        print str,
        sys.stdout.flush()
        # doIO()

main()

By using strace, I found that select is called every time hFlush is called in Haskell version. On the other hand, select is not called in Python version. I guess this is one of the reason that Haskell version is slow.

Are there any way to improve performance of Haskell version?

I already tried to omit hFlush and it certainly decreased CPU usage a lot. But this solution is not satisfiable because it does not flush.

Thanks.

EDITED

Thank you very very much for your help! By changing sequence and repeat to replicateM_, runtime is reduced from 14s to 3.8s.

But now I have another question. I asked the above question because when I removed hFlush from the above program, it runs fast despite it repeats I/O using sequence and repeat.

Why only the combination of sequence and hFlush makes it slow?

To confirm my new question, I changed my program as follows to do profiling.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.ByteString.Char8 as S
import System.IO
import Control.Monad

str = S.pack "FugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFuga\n"

doIO = S.hPutStr stdout str >> hFlush stdout
doIO' = S.hPutStr stdout str >> hFlush stdout
doIOWithoutFlush = S.hPutStr stdout str

main = do
  hSetBuffering stdout (BlockBuffering (Just 1000))
  sequence (take 1000000 (repeat doIO))
  replicateM_ 1000000 doIO'
  sequence (take 1000000 (repeat doIOWithoutFlush))
  return ()

By compiling and running as follows:

$ ghc -O2 -prof -fprof-auto Fuga.hs
$ ./Fuga +RTS -p -RTS > /dev/null

I got the following result.

COST CENTRE      MODULE  %time %alloc

doIO             Main     74.7   35.8
doIO'            Main     21.4   35.8
doIOWithoutFlush Main      2.6   21.4
main             Main      1.3    6.9

What makes the difference between doIO and doIO' which do the same task? And why doIOWithoutFlush runs fast even in sequence and repeat? Are there any reference about this behavior?

Thanks.

beketa
  • 56
  • 4
  • The use of lazy bytestrings here seems unnecessary. – Don Stewart Nov 13 '12 at 15:14
  • 1
    Lazy ByteStrings are indeed unnecessary and not fair compared to the Python, but that doesn't make much of a difference here -- it's one chunk, after all, so it's only a little bit more work per iteration. The biggest difference comes from using `M_` functions. – shachaf Nov 13 '12 at 15:21

3 Answers3

8

Calling hFlush on every write seems wrong.

This simple change, to use strict bytestrings, forM_ or replicateM_ instead of your explicit sequence, and block buffering, reduces runtime from 16.2s to 0.3s

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.ByteString.Char8 as S
import Control.Monad
import System.IO

str = S.pack "FugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFuga\n"

main = replicateM_ 1000000 $ S.putStr str

Though more idiomatic would be to use a single write of a lazy bytestring, relying on the bytestring subsystem to coordinate the writes.

import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Monad
import System.IO

str :: S.ByteString
str = S.pack "FugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFugaFuga\n"

main = L.putStr $ L.fromChunks (replicate 1000000 str)

With marginally improved performance (0.27s)

Don Stewart
  • 137,316
  • 36
  • 365
  • 468
  • Thanks. Actually I am developing a network client. It needs to flush after sending a line of command. That is why I included hFlush in the above program. When I was profiling my program, I found that CPU usage of a function which calls hFlush is high. On the way of investigating this issue, I found the avobe problem. As for the above problem, your solution is perfect. – beketa Nov 14 '12 at 01:32
6

I'm not sure about the Python code (what's doIO()?), but an obvious way to improve the Haskell is to use sequence_ instead of sequence, so it doesn't need to build up the huge list of ()s. That small change makes it 6-7 times faster on my machine.

(A simpler way of expressing that line would be replicateM_ 1000000 (L.hPutStr stdout str >> hFlush stdout).)

It might be that the number of system calls is significant -- GHC's RTS does do non-blocking I/O, and possibly makes unnecessary select calls -- but going by your numbers, this change might be enough to bring it into the Python range on its own.

shachaf
  • 8,890
  • 1
  • 33
  • 51
5

The big problem is that

sequence (take 1000000 (repeat (L.hPutStr stdout str >> hFlush stdout)))

collects the results of the IO-actions performed in a list. If you discard the results,

sequence_ (take 1000000 (repeat (L.hPutStr stdout str >> hFlush stdout)))

It'll be much faster and do less allocation.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431