Surprisingly low times when benchmarking Data.Vector

Question

I am benchmarking Haskell's array libraries (the array and vector packages) to come up with the best way of storing large data for my use case. I am using criterion as the benchmarking tool.

Long story short: my code simply allocates a vector and proceeds to fill it with simple structs (1M, 10M, and 100M elements, respectively). When I compare the Haskell benchmark times with a simple reference implementation I wrote in C, Haskell is a few times faster and I find it suspicious: the C code is a simple loop filling the structs in the array.

The question: is it possible for Haskell's vector library to beat C in terms of performance? Or does it mean my benchmarks are flawed/something is not actually evaluated/there's some 'gotcha'?

Another question how to make sure that the Haskell vectors are actually evaluated?

Longer explanation: The task at hand is to fill a vector with a large number of structs. They have Storable instances and the vector used is Data.Vector.Storable.

The data type is the following:

data Foo = Foo Int Int deriving (Show, Eq, Generic, NFData)

And the Storable instances look like this:

chunkSize :: Int
chunkSize = sizeOf (undefined :: Int)
{-# INLINE chunkSize #-}

instance Storable Foo where
    sizeOf    _ = 2 * chunkSize ; {-# INLINE sizeOf    #-}
    alignment _ = chunkSize     ; {-# INLINE alignment #-}
    peek ptr = Foo
        <$> peekByteOff ptr 0
        <*> peekByteOff ptr chunkSize
    {-# INLINE peek #-}
    poke ptr (Foo a b) = do
        pokeByteOff ptr 0 a
        pokeByteOff ptr chunkSize b
    {-# INLINE poke #-}

The serialization itself seems to work fine. The vector is then allocated:

mkFooVec :: Int -> IO (Vector Foo)
mkFooVec !i = unsafeFreeze =<< new (i + 1)

And populated with the structs:

populateFooVec :: Int -> Vector Foo -> IO (Vector Foo)
populateFooVec !i !v = do
    v' <- unsafeThaw v
    let go 0 = return ()
        go j = unsafeWrite v' j (Foo j $ j + 1) >> go (j - 1)
    go i
    unsafeFreeze v'

Benchmark is the standard criterion one:

    defaultMain [
      bgroup "Storable vector (mutable)"
        $ (\(i :: Int) -> env (mkFooVec (10 ^ i))
        $ \v -> bench ("10e" <> show i)
        $ nfIO (populateFooVec (10 ^ i) v))  <$> [6..8]
    ]

The gist contains other benchmarks, trying to force evaluation in different ways.

Reference C code doing more or less the same can be found here (gist). The main logic is the following:

Foo *allocFoos(long n) {
    return (Foo *) malloc(n * sizeof(Foo));
}

// populate the array with structs:
void createFoos(Foo *v, long n) {
    for (long i = 0; i < n; ++i) {
        v[i].name = i;
        v[i].id = i + 1;
    }
}

And the command used to run it: gcc -O2 -o bench benchmark.c && ./bench

Now when I run the benchmarks, the C code takes about 50ms, while Criterion reports results around 800 picoseconds (!). This makes me wonder: maybe I'm interpreting the results wrong? Maybe the vector isn't actually evaluated (although if you look at the Haskell gist, I try to force the evaluation in different ways). What am I doing wrong? If nothing -- how does vector beat a simple for loop in C (that GCC further unrolls, btw)?

Please pardon my terribly long question, I was trying to give the whole context ;)

Post the command line, how you compile your C code, probably no optimization flags there — Anton Malyshev, Feb 07 '18 at 14:28
Your C benchmark appears to be calling the function under test indirectly, via a pointer. This is unlikely to produce results reflective of the speed of calling the same function directly. — John Bollinger, Feb 07 '18 at 14:35
Your C benchmark is also inefficient in how it records the `Foo` values: it updates a local `Foo` with them, and then updates an array element from that, which copies each value twice. — John Bollinger, Feb 07 '18 at 14:38
I can't speak much to the Haskell, but are you sure the elements of your vector are stored *by value*, as the elements of your C array are? If they were instead stored *by reference* then that might well be faster to manipulate, and the semantics would be inequivalent to those of your C benchmark. — John Bollinger, Feb 07 '18 at 14:43
It's quite unlikely that it's actually populating a vector in a nanosecond. It seems quite likely it's optimizing your whole program away as unused in one way or another. — dfeuer, Feb 07 '18 at 14:44
Include the actual C code in your question, not just a link to it. Your question is meaningless without the code you're comparing. — Cubic, Feb 07 '18 at 14:47
The real question is: what makes you think you can get picosecond accuracy benchmarks on a bloody PC? — Lundin, Feb 07 '18 at 14:50
@dfeuer true. I simplified the code a bit to extract the minimal example. In my original code (with the structs more complex, but still nothing unusual) the timing was more reasonable. This is why I'm curious what is wrong in the haskell code/the benchmark. — piotrMocz, Feb 07 '18 at 14:55
@piotrMocz you can compare assembler outputs for gcc and the haskell compiler then, probably AVX is used there — Anton Malyshev, Feb 07 '18 at 15:10
@Lundin criterion uses pretty advanced statistical methods for its benchmarks - and it reports error terms, too. It certainly has the ability to benchmark sub-nanosecond timings. I would expect huge error terms in the cases where it does, though, and those weren't reported. — Carl, Feb 07 '18 at 17:04
It’s not *impossible* that ghc might deduce that a vector operation can be vectorized or parallelized when gcc can’t do the same for a translation into C. One aspect of C that’s showing its age is that you have to write all your vector operations as sequential loops that potentially can muck around with any element or the loop index itself in arbitrary ways, especially once pointer aliasing gets involved. Then, today, what you really want the compiler to do is figure out, *oh, you’re just performing the same operations on `i` or `a[i]` for each element,* and optimize to `forall` semantics. — Davislor, Feb 07 '18 at 17:44
But you’re right to be skeptical. You might try compiling with `-S` and checking the assembly code. But maybe the C version would benefit from a `#pragma omp` directive. — Davislor, Feb 07 '18 at 17:46
These unsafe thaw's and freezes are all violating the guarantee you are (implicitly) giving by making the call. I wouldn't trust the code any further than I can throw my mainframe, let alone for benchmark numbers. Want to use a mutable vector? Then use a mutable vector, stop pretending an immutable vector is mutable with all sorts of dishonest calls. — Thomas M. DuBuisson, Feb 07 '18 at 18:01
@Davislor As far as I could tell, the Haskell-generated assembly didn't contain any vectorized instructions. Still, even if it used vectors heavily, the picosecond times suggest it's actually a problem of not evaluating the computation. I will try to produce a better example tomorrow. — piotrMocz, Feb 07 '18 at 23:17
Yeah. You could see if a `seq` or `$!` in the right place fixes it? — Davislor, Feb 07 '18 at 23:20

score 1 · Accepted Answer · answered Feb 08 '18 at 00:29

While I don't trust the benchmarking code I also can not reproduce the issue. I modified the Haskell gist (just removed the second two benchmarks) and the C benchmark (made it perform the operation 1000 times then divided the times by 1000).

EDIT: I don't trust the code because:

You are using unsafe* calls that have implicit contracts you violate.
The code doesn't even compile - you have a typo and a missing language extension. This is usually an indication of other shenanigans.

My Results

What is the result? Spot on, no oddities here.

% gcc bench.c -O3 && ./a.out
Starting the benchmark
[[ Malloced-array-[10000000] ]]Time taken: 11.904249 ms (cpu) 11.904249 ms (wall)
Done
./a.out  11.78s user 0.14s system 98% cpu 12.131 total

i.e. 11ms for C at 10^7 elements.

and

% ghc -O2 bench.hs && ./bench
benchmarking Storable vector (FAKE mutable)/10e6
time                 2.362 ms   (2.236 ms .. 2.561 ms)
                     0.953 R²   (0.909 R² .. 0.989 R²)
mean                 2.344 ms   (2.268 ms .. 2.482 ms)
std dev              305.0 μs   (169.1 μs .. 477.1 μs)
variance introduced by outliers: 79% (severely inflated)

benchmarking Storable vector (FAKE mutable)/10e7
time                 23.37 ms   (22.13 ms .. 24.73 ms)
                     0.989 R²   (0.979 R² .. 0.996 R²)
mean                 23.19 ms   (22.63 ms .. 23.76 ms)
std dev              1.287 ms   (1.015 ms .. 1.713 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Storable vector (FAKE mutable)/10e8
time                 232.2 ms   (215.1 ms .. 247.3 ms)
                     0.994 R²   (0.974 R² .. 1.000 R²)
mean                 223.5 ms   (215.9 ms .. 231.5 ms)
std dev              10.41 ms   (7.887 ms .. 13.06 ms)
variance introduced by outliers: 14% (moderately inflated)

i.e. 23ms for Haskell at 10^7 result.

This is on a moderately new macbook with GHC 8.2.

Surprisingly low times when benchmarking Data.Vector

1 Answers1