0

I am trying to use Vector to add integer values from 2 arrays faster than a traditional for loop.

My Vector count is: 4 which should mean that the addArrays_Vector function should run about 4 times faster than: addArrays_Normally

var vectSize = Vector<int>.Count;

This is true on my computer:

Vector.IsHardwareAccelerated

However strangely enough those are the benchmarks:
addArrays_Normally takes 475 milliseconds addArrays_Vectortakes 627 milliseconds

How is this possible? Shouldn't addArrays_Vector take only approx 120 milliseconds? I wonder if I do this wrong?

        void runVectorBenchmark()
        {
            var v1 = new int[92564080];
            var v2 = new int[92564080];
            for (int i = 0; i < v1.Length; i++)
            {
                v1[i] = 2;
                v2[i] = 2;
            }
            
            //new Thread(() => addArrays_Normally(v1, v2)).Start();
            new Thread(() => addArrays_Vector(v1, v2, Vector<int>.Count)).Start();
        }
        void addArrays_Normally(int[] v1, int[] v2)
        {
            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            int sum = 0;
            int i = 0;
            for (i = 0; i < v1.Length; i++)
            {
                sum = v1[i] + v2[i];
            }
            stopWatch.Stop();
            MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
        }
        void addArrays_Vector(int[] v1, int[] v2, int vectSize)
        {
            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            int[] retVal = new int[v1.Length];
            int i = 0;
            for (i = 0; i < v1.Length - vectSize; i += vectSize)
            {
                var va = new Vector<int>(v1, i);
                var vb = new Vector<int>(v2, i);
                var vc = va + vb;
                vc.CopyTo(retVal, i);
            }
            stopWatch.Stop();
            MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
        }
Andreas
  • 1,121
  • 4
  • 17
  • 34
  • Please don't use JavaScript snippet blocks for C# code. You can't run it in a browser, just use a normal code-formatting block. Also, did you build this with optimization enabled? Debug / anti-optimized builds often causes a "negative speedup" for manually vectorized code. – Peter Cordes Apr 09 '20 at 19:15
  • 1
    Or maybe your compiler auto-vectorized that simple loop. Or optimized away some of the work because `sum = v1[i] + v2[i];` isn't `+=`. The final result only depends on the final loop iteration! In fact the `sum` isn't even printed or returned so a compiler could completely optimize away the loop to nothing at all. – Peter Cordes Apr 09 '20 at 19:20

1 Answers1

1

Two functions are different. And looks like RAM memory is a bottleneck here:

  • in the first example

        var v1 = new int[92564080];
        var v2 = new int[92564080];
    
        ...
    
        int sum = 0;
        int i = 0;
        for (i = 0; i < v1.Length; i++)
        {
            sum = v1[i] + v2[i];
        }
    

Code is reading both array once. So memory consumption is: sizeof(int) * 92564080 * 2 == 4 * 92564080 * 2 == 706 MB .

  • in the second example

        var v1 = new int[92564080];
        var v2 = new int[92564080];
    
        ...            
    
        int[] retVal = new int[v1.Length];
        int i = 0;
        for (i = 0; i < v1.Length - vectSize; i += vectSize)
        {
            var va = new Vector<int>(v1, i);
            var vb = new Vector<int>(v2, i);
            var vc = va + vb;
            vc.CopyTo(retVal, i);
        }
    

Code is reading 2 input arrays and writing into an output array. Memory consumption is at least sizeof(int) * 92564080 * 3 == 1 059 MB

Update:

RAM is much slower than CPU / CPU cache. From this great article about Memory Bandwidth Napkin Math roughly:

L1 Bandwidth: 210 GB/s

...

RAM Bandwidth: 45 GB/s

So extra memory consumption would neglect vectorization speed up.

And the Youtube video mentioned is doing comparison on different code, non-vectorized code from the video is as follows, which consumes the same amount of memory as the vectorized code:

    int[] AddArrays_Simple(int[] v1, int[] v2)
    {
        int[] retVal = new int[v1.Length];
        for (int i = 0; i < v1.Length; i++)
        {
            retVal[i] = v1[i] + v2[i];
        }
        return retVal;
    }
Renat
  • 7,718
  • 2
  • 20
  • 34
  • I am not sure I am following examples from `https://www.youtube.com/watch?v=wPT6iu3MZP0` exactly so the code should be correct? But still I don't understand why it doesn't go 4 times faster. How should I change the code? Even if I put only `925600` elements, still the `addArrays_Vector` goes a lot slower. (I do have 32 GB of RAM) – Andreas Apr 09 '20 at 17:10
  • Yes thank you, I forgot to put: `int[] retVal = new int[v1.Length];` as you showed there. But still now the `AddArrays_Simple` is showing: `50ms` and the `addArrays_Vector` shows `60ms`. I don't understand why the `addArrays_Vector` doesn't show close to 4 times faster as: `15ms`? (Using 9256000 items) – Andreas Apr 09 '20 at 17:43
  • `clrjit.dll is loaded which means RyuJIT is being used to compile all managed code` and I have put x64 as the Platform target but still there is no speed improvement. – Andreas Apr 09 '20 at 18:18
  • 1
    Also keep in mind that a store basically costs twice as much memory bandwidth because newly-touched cache lines have to be read-for-ownership before the store can commit, as well as bandwidth for dirty cache lines being evicted and written back. – Peter Cordes Apr 09 '20 at 19:23