1

Given the following code:

    public float[] weights;
    public void Input(Neuron[] neurons)
    {
        float output = 0;

        for (int i = 0; i < neurons.Length; i++)
            output += neurons[i].input * weights[i];
    }

Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'

Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.

Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).

Eduard G
  • 443
  • 5
  • 21
  • I'm having problems to grasp what you are looking for. There is a limited amount of transistors on your device and their width is also fixed. So you might be looking into the wrong direction (except you know a-priori compilation that your sizes are fixed and small). What you probably want to do is: look what every other algebraic high-performance library is doing: cache-optimization, loop-unrolling, vectorization (simd, avx, ...). This type of computation is so common, that it's been optimized for decades: [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) : Level 1: sdot – sascha Aug 12 '20 at 15:56
  • @sascha that's pretty much it. I searched for similar libraries but didn't find anything that would be of use, but now that you showed me BLAS I found this -> https://github.com/xianyi/OpenBLAS . What I am essentially trying to achieve is to calculate the dot product, and it seems the library might have something that can help me out. Hopefully there are performance benefits. Thanks for the tip, I'll try to implement it. – Eduard G Aug 12 '20 at 22:02
  • 1
    Have you considered [Parallel.For](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.for?view=netcore-3.1)? It's easy/fast to implement, and could perform faster by utilizing more threads/cores. Other solutions might give better performance overall (especially if all CPU cores are already busy during this calculation), but you can implement `Parallel.For` very quickly, and without a dependency on third-party libraries. – Sean Skelly Aug 12 '20 at 22:18
  • @SeanSkelly It is already multi-threaded on a higher level. I tried using parallel.for there too, but running small operations like a simple multiplication in parallel results in too much context switching, which kills performance. In this case performance went from 0.6s to 3s for execution. P.S. Your comment made me realize another place I can call Parallel.For up the chain, which makes the code execute in 0.2s now. Thanks! – Eduard G Aug 12 '20 at 22:35
  • @sascha The OpenBLAS library isn't installing for Framework 4.7.2. Nuget throws and error. Do you know any other reputable library I can use? – Eduard G Aug 12 '20 at 22:35
  • Next I will try the following: https://numerics.mathdotnet.com/ – Eduard G Aug 12 '20 at 22:38
  • Math.NET is slower than the ordinary implementation. The dot product of two vectors executed serially performs about 30% slower, and there doesn't seem to be an async implementation. – Eduard G Aug 12 '20 at 23:09

1 Answers1

0

I think you are looking for AVX in C#

So you can actually calculate several values in one command.

Thats SIMD for CPU cores. Take a look at this

Here an example from the website:

public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
    var simdLength = Vector<int>.Count;
    var result = new int[lhs.Length];
    var i = 0;
    for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
    {
        var va = new Vector<int>(lhs, i);
        var vb = new Vector<int>(rhs, i);
        (va + vb).CopyTo(result, i);
    }

    for (; i < lhs.Length; ++i)
    {
        result[i] = lhs[i] + rhs[i];
    }

    return result;
}

You can also combine it with the parallelism you already use.

Oliver
  • 93
  • 1
  • 7
  • Tried it, tried varying the implementation too so it will return a sum of the dot product. Even just running the code from that example WITH code optimizations results in several times worse performance than what I already have. – Eduard G Aug 13 '20 at 22:46
  • @EduardG Do you use at least version .NET 4.6 or which version do you use ? Do your CPU support AVX? – Oliver Aug 19 '20 at 09:45