Vectorized C# code with SIMD using Vector running slower than classic loop

Question

I've seen a few articles describing how Vector<T> is SIMD-enabled and is implemented using JIT intrinsics so the compiler will correctly output AVS/SSE/... instructions when using it, allowing much faster code than classic, linear loops (example here).

I decided to try to rewrite a method I have to see if I managed to get some speedup, but so far I failed and the vectorized code is running 3 times slower than the original, and I'm not exactly sure as to why. Here are two versions of a method checking if two Span<float> instances have all the pairs of items in the same position that share the same position relative to a threshold value.

// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
        for (int i = 0; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    return true;
}

// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
    // Setup the test vector
    int l = Vector<float>.Count;
    float* arr = stackalloc float[l];
    for (int i = 0; i < l; i++)
        arr[i] = threshold;
    Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
    {
        // Iterate in chunks
        int
            div = x1.Length / l,
            mod = x1.Length % l,
            i = 0,
            offset = 0;
        for (; i < div; i += 1, offset += l)
        {
            Vector<float>
                v1 = Unsafe.Read<Vector<float>>(px1 + offset),
                v1cmp = Vector.GreaterThan<float>(v1, cmp),
                v2 = Unsafe.Read<Vector<float>>(px2 + offset),
                v2cmp = Vector.GreaterThan<float>(v2, cmp);
            float*
                pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
                pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
            for (int j = 0; j < l; j++)
                if (pcmp1[j] == 0 != (pcmp2[j] == 0))
                    return false;
        }

        // Test the remaining items, if any
        if (mod == 0) return true;
        for (i = x1.Length - mod; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    }
    return true;
}

As I said, I've tested both versions using BenchmarkDotNet, and the one using Vector<T> is running around 3 times slower than the other one. I tried running the tests with spans of different length (from around 100 to over 2000), but the vectorized method keeps being much slower than the other one.

Am I missing something obvious here?

Thanks!

EDIT: the reason why I'm using unsafe code and trying to optimize this code as much as possible without parallelizing it is that this method is already being called from within a Parallel.For iteration.

Plus, having the ability to parallelize the code over multiple threads is generally not a good reason to leave the individual parallel tasks not optimized.

Just speaking from my personal experience, I will go to another direction using [Parallel.For](https://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.for%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396) for multi-threading instead of going into unsafe code to speedup my code. — Gordon, Jan 11 '18 at 02:01
@Gordon I'm already using `Parallel.For`, this method will actually be called in each of those parallel iterations. — Sergio0694, Jan 11 '18 at 02:07
If you are really interested in performance, you might want to consider moving your code to c++ where you can use features that .NET [doesnt support](https://stackoverflow.com/a/10775820/585968) (at least from coders' point of view) like SSE2 and beyond. Bridge it with c++/CLI or straight up p-invoke. Using `unsafe` and pointers excessively in c# is like fighting with the language — , Jan 11 '18 at 02:25
@MickyD I know, but I'm working on a C# .NET Standard lib and including C++ code really isn't an option. Plus, there's the fact that I'm taking the opportunity to learn the language better as well as how to push it as much as possible. Besides, my question here was more about the `Vector` APIs themselves, as at this point I'm honestly curious as to why the vectorized code is performing slower. — Sergio0694, Jan 11 '18 at 02:34
...or you can look into GPGPU, that will outperform any feeble stuff in the CPU I suspect and is arguably more elegant — , Jan 11 '18 at 02:36
That's why mine was a _comment_ and not an _answer_. Wishing you well — , Jan 11 '18 at 02:36
@MickyD Pinvoke has huge overhead in starting up, I don't think it will speed up anything, C++/CLI may help but beware of marshalling. — Gordon, Jan 11 '18 at 02:40
@MickyD I'm already using GPU acceleration in my library, but there still is a CPU-only part which is used when a CUDA GPU is not available, which I'd like to optimize as much as possible. And I'm sorry if my previous comment came out in the wrong way, I didn't mean to sound rude or annoyed at all, I was just trying to explain why I was interested in optimizing this code that way - in fact your observations were 100% valid of course. — Sergio0694, Jan 11 '18 at 02:40
@Sergio0694 Well then you may be more experienced than most people out here. It's hard to find someone to answer your question about C#/C++ hybrid optimization problems as no one is interested in it for recent years. In my point of view, Parallel.For should handle/unwrap for loops, there should not be another for loop inside Parallel.For. For most of the time, C# can be written to perform as fast as C++ by cleverly design and utilize C# features. — Gordon, Jan 11 '18 at 02:47
That's all fine good sir. Your's is a very exciting project. Wishing you well. :) — , Jan 11 '18 at 03:05
@Gordon: SIMD and thread-level parallelism are completely orthogonal. (Or on AMD Bulldozer-family where pairs of integer cores share a vector / FPU unit, only mostly orthogonal, not completely. But logically they're always orthogonal.) If your problem has the kind of parallelism that SIMD can exploit, you definitely want the CPU to be running SIMD instructions, however you go about making that happen. Sometimes you can hit a memory bottleneck with enough threads, but not all systems will have enough cores to saturate their memory bandwidth without SIMD. (e.g. dual core laptop) — Peter Cordes, Jan 11 '18 at 05:26
@Sergio0694 Did https://stackoverflow.com/a/49164908/136675 solve your problem? — Paul Westcott, Mar 11 '18 at 18:54
@PaulWestcott Hey, thanks for the answer, my bad, I had completely missed that notification. I'll try it out as soon as possible and I'll be happy to mark it as valid if it does indeed work in my case — Sergio0694, Mar 11 '18 at 19:19

score 2 · Answer 1 · answered Sep 25 '19 at 11:14

I had the same problem. The solution was to uncheck the Prefer 32-bit option at the project properties.

SIMD is only enabled for 64-bit processes. So make sure your app either is targeting x64 directly or is compiled as Any CPU and not marked as 32-bit preferred. [Source]

Paul Westcott · Answer 2 · 2018-03-08T21:24:45.690

** EDIT ** After reading a blog post by Marc Gravell, I see that this can be achieved simply...

public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)
{
    if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");

    if (Vector.IsHardwareAccelerated)
    {
        var vx1 = x1.NonPortableCast<float, Vector<float>>();
        var vx2 = x2.NonPortableCast<float, Vector<float>>();

        var vthreshold = new Vector<float>(threshold);
        for (int i = 0; i < vx1.Length; ++i)
        {
            var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
            var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
            if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
                return false;
        }

        x1 = x1.Slice(Vector<float>.Count * vx1.Length);
        x2 = x2.Slice(Vector<float>.Count * vx2.Length);
    }

    for (var i = 0; i < x1.Length; i++)
        if (x1[i] > threshold != x2[i] > threshold)
            return false;

    return true;
}

Now this is not quite as quick as using array's directly (if that's what you have) but is still significantly faster than the non-SIMD version...

(Another edit...)

...and just for fun I thought I would see well this stuff handles works when fully generic, and the answer is very well... so you can write code like the following, and it is just as efficient as being specific (well except in the non-hardware accelerated case, in which case its a bit less than twice as slow - but not completely terrible...)

    public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
        where T : struct
    {
        if (x1.Length != x2.Length)
            throw new ArgumentException("x1.Length != x2.Length");

        if (Vector.IsHardwareAccelerated)
        {
            var vx1 = x1.NonPortableCast<T, Vector<T>>();
            var vx2 = x2.NonPortableCast<T, Vector<T>>();

            var vthreshold = new Vector<T>(threshold);
            for (int i = 0; i < vx1.Length; ++i)
            {
                var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
                var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
                if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
                    return false;
            }

            // slice them to handling remaining elementss
            x1 = x1.Slice(Vector<T>.Count * vx1.Length);
            x2 = x2.Slice(Vector<T>.Count * vx1.Length);
        }

        var comparer = System.Collections.Generic.Comparer<T>.Default;
        for (int i = 0; i < x1.Length; i++)
            if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
                return false;

        return true;
    }

"not quite as quick as using array's directly" are you saying that vectorizing still doesn't provide a speed boost? — Qwertie, Jan 24 '20 at 00:24

score 0 · Answer 3 · answered Mar 19 '18 at 20:34

0

A vector is just a vector. It doesn't claim or guarantee that SIMD extensions are used. Use

System.Numerics.Vector2

https://learn.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types

answered Mar 19 '18 at 20:34

Uğur Gümüşhan

2,455
4
34
62

Vectorized C# code with SIMD using Vector running slower than classic loop

3 Answers3

Linked