2

Spoiler: Yes. See below.

Trying to optimize a letter counter to match C. I've fought it to a 2x deficit.

letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
    V.accumulate
        (\a _ -> a + 1)
        (V.replicate 256 0)
        letters1
  where
    len = B.length bs
    letters1 = V.generate len (\i -> (fromIntegral $! B.index bs i, ()))

Some notes:

  1. It was really slow until I changed Data.Vector to Data.Vector.Unboxed. Why is that?
  2. I thought most of the time would be spent in accumulate. I was wrong.
  3. 70% of the time is spent in generate.
  4. Haskell code suffers from having to pointlessly convert Word8 to Int; Also, a useless army of () may or may not actually be created.

Full listing:

import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
import System.Environment
import Text.Printf

letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
    V.accumulate
        (\a _ -> a + 1)
        (V.replicate 256 0)
        letters1
  where
    len = B.length bs
    letters1 = V.generate len (\i -> (fromIntegral $! B.index bs i, ()))

printCounts :: V.Vector Int -> IO ()
printCounts cs =
    mapM_
        (uncurry $ printf "%c: %d\n")
        (zip (map toEnum [0..255] :: String) (V.toList cs))

main :: IO ()
main = do
    filename <- fmap head getArgs
    f <- B.readFile filename
    let counts = letterCount f
    printCounts counts

Competing C code:

    #include <assert.h>
    #include <stdio.h>
    #include <string.h>
    #include <sys/stat.h>
    #include <stdlib.h>

    int letcnt [256];

    int* letter_count(unsigned char *s, unsigned int len)
    {
        int i;
        memset(letcnt, 0, 256 * sizeof(int));
        for(i = 0; i < len; i++){
            letcnt[*(s + i)]++;
        }
        return (letcnt);
    }

    void print_counts() {
        int i;
        for(i = 0; i < 256; i++) {
            printf("'%c': %d\n", (unsigned char) i, letcnt[i]);
        }
    }
    // st_size
    int main(int argc, char **argv)
    {
        assert(argc == 2);
        FILE* f = fopen(argv[1], "r");
        struct stat st;
        stat(argv[1], &st);
        off_t len = st.st_size;
        unsigned char* contents = calloc(len, 1);
        fread(contents, len, 1, f);
        fclose(f);
        letter_count(contents, len);
        print_counts();
        return 0;
    }

Timings;

$ time ./a.out /usr/share/dict/words > /dev/null

real  0m0.012s
user  0m0.005s
sys 0m0.005s

$ time ./lettercount /usr/share/dict/words > /dev/null

real  0m0.017s
user  0m0.009s
sys 0m0.007s

Update

I think the performance ceiling is down to this bug: runST isn't free. Not that I believe it's impossible to optimize further but unlikely to approach C so long as runST imposes some overhead.

Also, fixed C-code based on @Zeta's comment.

Michael Fox
  • 3,632
  • 1
  • 17
  • 27
  • 1
    On my (slow) MacBook Air, when I compile them with full optimization (`-O3` for C, `-O2` for Haskell) they comparable: 0.016 for C, 0.018 for Haskell. Not a 2x deficit. – Tomo Dec 02 '14 at 12:43
  • 2
    Shouldn't this go on CodeReview.SE? – Sebastian Redl Dec 02 '14 at 12:53
  • @SebastianRedl Probably. I had not heard of. Might be my new favorite thing. – Michael Fox Dec 02 '14 at 12:56
  • @Tomo I'm looking at `user` time also on a slow macbook air and taking the best of 3 runs. 9ms vs 5ms consistently. Looking at `real` time may close the gap since both have the same IO overhead, hopefully. – Michael Fox Dec 02 '14 at 12:59
  • 1
    IMHO, "Could you do better" is either a [code review](http://codereview.stackexchange.com) or a [programming puzzle](http://codegolf.stackexchange.com). – Zeta Dec 02 '14 at 13:03
  • 1
    @Zeta Programming puzzles? So many new favorite things. – Michael Fox Dec 02 '14 at 13:05
  • 1
    Those times are too small for any sensible comparison. E.g., the Haskell RTS is guaranteed to have a startup time that is a lot longer than C's. – augustss Dec 02 '14 at 13:26
  • 1
    Also, your C program has undefined behaviour (`char` is `signed`, at least on my system). That being said, both programs finish in `0.00s` on my system. – Zeta Dec 02 '14 at 13:28
  • 1
    You might want to get to know `-ddump-simpl`, perhaps with `-dsuppress-all -dno-suppress-type-signatures`, and also `-ddump-asm`. Then you can stop guessing so much. – dfeuer Dec 02 '14 at 13:33
  • Use a larger dataset: 0.01s benchmarks are too short to be meaningful. – chi Dec 02 '14 at 15:08
  • @Zeta, et. all. Yes. I test it with much larger files on my system with more divergence. For example 538 versus 186 ms on a 150M file. – Michael Fox Dec 02 '14 at 21:16
  • @AndrewC Well, this isn't a real application or anything. I started off wanting to learn how to use Data.Vector and then see what more I can learn in the process. But sure, except learning today has been a loss. – Michael Fox Dec 03 '14 at 00:36
  • 1
    I shouldn't have commented after the difficult day I'd had. Apologies. – AndrewC Dec 03 '14 at 08:06
  • Currently this question boils down to are you better at writing Haskell or C code, either could be slower depending on how you write/compile your code. I'm not voting to close the question at the moment because it feels like you have a good question fighting to get out and you've attracted some useful answers. Perhaps something along the lines of 'What are the limiting factors to Haskell performance when counting letters and how can they be overcome?' would be closer to an answerable question and still fit in with the answers you've generated. – forsvarir Dec 08 '14 at 13:49

3 Answers3

3

Point 1. A boxed vector is an array of pointers to possibly-unevaluated expressions that produce an Int. An unboxed vector is just an array of integers. It's definitely strict, it means far less memory allocation / garbage collection, and it probably has better CPU cache behavior. This is why the unboxed version is offered in the first place!

Point 4. My understanding is that conversions between integer types are no-op at run-time. Basically Int and Word8 are stored identically; the only difference is how (+) and similar are implemented.

Also, it is my understanding that nullary constructors such as () (and also True, False, Nothing, ...) are shared among all instances. So you are not "creating" an army of () values.

MathematicalOrchid
  • 61,854
  • 19
  • 123
  • 220
3

It's a bit faster if you remove the bounds checking:

import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as B
import qualified Data.Vector.Unboxed as V
import System.Environment
import Text.Printf

letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
    V.unsafeAccumulate
        (\a _ -> a + 1)
        (V.replicate 256 0)
        letters1
  where
    len = B.length bs
    letters1 = V.generate len (\i -> (fromIntegral $! B.unsafeIndex bs i, ()))

printCounts :: V.Vector Int -> IO ()
printCounts cs =
    mapM_
        (uncurry $ printf "%c: %d\n")
        (zip (map toEnum [0..255] :: String) (V.toList cs))

main :: IO ()
main = do
    filename <- fmap head getArgs
    f <- B.readFile filename
    let counts = letterCount f
    printCounts counts

However, the runtimes vary too much (both for the C and Haskell versions) because the input size is too small.

tibbe
  • 8,809
  • 7
  • 36
  • 64
3

Yes. If you compile with -fllvm then Haskell will match C in User time. The big surprise is if you switch to Lazy Bytestrings, the Haskell version will beat the C version on Real time by a small but significant margin.

import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.Vector.Unboxed as V
import System.Environment
import Text.Printf

letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
    V.unsafeAccumulate
        (\a _ -> a + 1)
        (V.replicate 256 0)
        (parse bs)

parse :: B.ByteString -> V.Vector (Int, ())
parse = V.unfoldr step
  where
    step s = if B.null s
        then Nothing
        else Just ((fromIntegral . fromEnum $ B.head s, ()), B.tail s)
{-# INLINE parse #-}

printCounts :: V.Vector Int -> IO ()
printCounts cs =
    mapM_
        (uncurry $ printf "%c: %d\n")
        (zip (map toEnum [0..255] :: String) (V.toList cs))

main :: IO ()
main = do
    filename <- fmap head getArgs
    f <- B.readFile filename
    let counts = letterCount f
    printCounts counts

Remember to compile like:

ghc -O2 -fllvm letterCount.hs

So, Vector + ByteString.Lazy + LLVM > C. I love it!

Update

In fairness to C I updated the C code to use a single buffer which avoids doing a big allocation up front (or any allocations at all) and will be more cache friendly. Now the Haskell and C codes show no significant difference, both about 190ms in run-time on a best-of-3 basis against a large, 150M input file:

#include <assert.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <stdlib.h>

#define CHUNK 16384

int letcnt [256];

int* letter_count(unsigned char *s, unsigned int len)
{
  int i;
  for(i = 0; i < len; i++){
    letcnt[*(s + i)]++;
  }
  return (letcnt);
}

int* letter_count_chunks(unsigned int len, FILE* f)
{
  int i;
  unsigned char chunk [CHUNK];
  memset(letcnt, 0, sizeof(letcnt));
  for(i = 0; i < len - CHUNK; i+= CHUNK) {
    fread(chunk, CHUNK, 1, f);
    letter_count(chunk, CHUNK);
  }
  fread(chunk, len - i, 1, f);
  letter_count(chunk, len - i);

  return letcnt;
}

void print_counts() {
  int i;
  for(i = 0; i < 256; i++) {
    printf("'%c': %d\n", (unsigned char) i, letcnt[i]);
  }
}
// st_size
int main(int argc, char **argv)
{
  assert(argc == 2);
  FILE* f = fopen(argv[1], "r");
  struct stat st;
  stat(argv[1], &st);
  off_t len = st.st_size;
  letter_count_chunks(len, f);
  fclose(f);
  print_counts();
  return 0;
}
Michael Fox
  • 3,632
  • 1
  • 17
  • 27
  • You C code fails if the file is < CHUNK bytes (change len to int in declaration for letter_count_chunks). `letter_count` and `letter_count_chunks` should probably be void since you don't use the return value. Depending on compiler you may also get better performance using pointers directly in your letter_count function, something like `for(unsigned char *e = s+len;s – forsvarir Dec 08 '14 at 13:55