Spoiler: Yes. See below.
Trying to optimize a letter counter to match C. I've fought it to a 2x deficit.
letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
V.accumulate
(\a _ -> a + 1)
(V.replicate 256 0)
letters1
where
len = B.length bs
letters1 = V.generate len (\i -> (fromIntegral $! B.index bs i, ()))
Some notes:
- It was really slow until I changed
Data.Vector
toData.Vector.Unboxed
. Why is that? - I thought most of the time would be spent in
accumulate
. I was wrong. - 70% of the time is spent in
generate
. - Haskell code suffers from having to pointlessly convert Word8 to Int; Also, a useless army of
()
may or may not actually be created.
Full listing:
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed as V
import System.Environment
import Text.Printf
letterCount :: B.ByteString -> V.Vector Int
letterCount bs =
V.accumulate
(\a _ -> a + 1)
(V.replicate 256 0)
letters1
where
len = B.length bs
letters1 = V.generate len (\i -> (fromIntegral $! B.index bs i, ()))
printCounts :: V.Vector Int -> IO ()
printCounts cs =
mapM_
(uncurry $ printf "%c: %d\n")
(zip (map toEnum [0..255] :: String) (V.toList cs))
main :: IO ()
main = do
filename <- fmap head getArgs
f <- B.readFile filename
let counts = letterCount f
printCounts counts
Competing C code:
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <stdlib.h>
int letcnt [256];
int* letter_count(unsigned char *s, unsigned int len)
{
int i;
memset(letcnt, 0, 256 * sizeof(int));
for(i = 0; i < len; i++){
letcnt[*(s + i)]++;
}
return (letcnt);
}
void print_counts() {
int i;
for(i = 0; i < 256; i++) {
printf("'%c': %d\n", (unsigned char) i, letcnt[i]);
}
}
// st_size
int main(int argc, char **argv)
{
assert(argc == 2);
FILE* f = fopen(argv[1], "r");
struct stat st;
stat(argv[1], &st);
off_t len = st.st_size;
unsigned char* contents = calloc(len, 1);
fread(contents, len, 1, f);
fclose(f);
letter_count(contents, len);
print_counts();
return 0;
}
Timings;
$ time ./a.out /usr/share/dict/words > /dev/null
real 0m0.012s
user 0m0.005s
sys 0m0.005s
$ time ./lettercount /usr/share/dict/words > /dev/null
real 0m0.017s
user 0m0.009s
sys 0m0.007s
Update
I think the performance ceiling is down to this bug: runST isn't free. Not that I believe it's impossible to optimize further but unlikely to approach C so long as runST imposes some overhead.
Also, fixed C-code based on @Zeta's comment.