I am benchmarking Haskell's array libraries (the array
and vector
packages) to come up with the best way of storing large data for my use case. I am using criterion
as the benchmarking tool.
Long story short: my code simply allocates a vector and proceeds to fill it with simple structs (1M, 10M, and 100M elements, respectively). When I compare the Haskell benchmark times with a simple reference implementation I wrote in C, Haskell is a few times faster and I find it suspicious: the C code is a simple loop filling the structs in the array.
The question: is it possible for Haskell's vector
library to beat C in terms of performance? Or does it mean my benchmarks are flawed/something is not actually evaluated/there's some 'gotcha'?
Another question how to make sure that the Haskell vectors are actually evaluated?
Longer explanation: The task at hand is to fill a vector with a large number of structs. They have Storable
instances and the vector used is Data.Vector.Storable
.
The data type is the following:
data Foo = Foo Int Int deriving (Show, Eq, Generic, NFData)
And the Storable
instances look like this:
chunkSize :: Int
chunkSize = sizeOf (undefined :: Int)
{-# INLINE chunkSize #-}
instance Storable Foo where
sizeOf _ = 2 * chunkSize ; {-# INLINE sizeOf #-}
alignment _ = chunkSize ; {-# INLINE alignment #-}
peek ptr = Foo
<$> peekByteOff ptr 0
<*> peekByteOff ptr chunkSize
{-# INLINE peek #-}
poke ptr (Foo a b) = do
pokeByteOff ptr 0 a
pokeByteOff ptr chunkSize b
{-# INLINE poke #-}
The serialization itself seems to work fine. The vector is then allocated:
mkFooVec :: Int -> IO (Vector Foo)
mkFooVec !i = unsafeFreeze =<< new (i + 1)
And populated with the structs:
populateFooVec :: Int -> Vector Foo -> IO (Vector Foo)
populateFooVec !i !v = do
v' <- unsafeThaw v
let go 0 = return ()
go j = unsafeWrite v' j (Foo j $ j + 1) >> go (j - 1)
go i
unsafeFreeze v'
Benchmark is the standard criterion one:
defaultMain [
bgroup "Storable vector (mutable)"
$ (\(i :: Int) -> env (mkFooVec (10 ^ i))
$ \v -> bench ("10e" <> show i)
$ nfIO (populateFooVec (10 ^ i) v)) <$> [6..8]
]
The gist contains other benchmarks, trying to force evaluation in different ways.
Reference C code doing more or less the same can be found here (gist). The main logic is the following:
Foo *allocFoos(long n) {
return (Foo *) malloc(n * sizeof(Foo));
}
// populate the array with structs:
void createFoos(Foo *v, long n) {
for (long i = 0; i < n; ++i) {
v[i].name = i;
v[i].id = i + 1;
}
}
And the command used to run it: gcc -O2 -o bench benchmark.c && ./bench
Now when I run the benchmarks, the C code takes about 50ms, while Criterion reports results around 800 picoseconds (!). This makes me wonder: maybe I'm interpreting the results wrong? Maybe the vector isn't actually evaluated (although if you look at the Haskell gist, I try to force the evaluation in different ways). What am I doing wrong? If nothing -- how does vector
beat a simple for loop in C (that GCC further unrolls, btw)?
Please pardon my terribly long question, I was trying to give the whole context ;)