37

I have a video processing application that moves a lot of data.

To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.

However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.

Currently I have the following:

public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];

I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.

int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
   for (int x = 0; x < width; x++, p++) {
      newResults[p] = largeArray[lookUp[p]];
   }
}

Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.

Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.

I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).

Would an IntPtr be faster? How would I go about that?

Attached below is a screenshot of time distribution:

enter image description here

S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
RobotRock
  • 4,211
  • 6
  • 46
  • 86
  • 4
    It'll be slightly faster to only have one `for` that goes to `height*width`, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in `lookUp` are distributed? – 500 - Internal Server Error Sep 18 '19 at 11:17
  • The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations – RobotRock Sep 18 '19 at 11:22
  • I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler. – RobotRock Sep 18 '19 at 11:28
  • @RobotRock: Off-topic , What tool are you using to display the time distribution? – Postlagerkarte Sep 18 '19 at 11:31
  • 1
    @Postlagerkarte It's VS2017 Diagnostics Tool – RobotRock Sep 18 '19 at 11:32
  • @jdweng depends on what you are doing!..... Looking up something... with say a key , if the array index is the key then totally accessing something directly with the index is the fastest!, but i suspect there is some sort of lookup wanting to be performed, which i havnt quite understood yet, hence asking for me info on what is being looked up. largeArray... ok and you want to find something by the index.... var thing = largeArray[index], i have a strong feeling its more involved than that, which is why im want a little more clarity. – Seabizkit Sep 18 '19 at 11:41
  • A small LUT with a refinement step might give you the middle ground. cos and sine are smooth functions so a simple starting-point + slope for your LUT entries would give you a linear approximation. Or just use linear interpolation between LUT entries. Also, cos = slope of sin = derivative. – Peter Cordes Sep 18 '19 at 22:11
  • This question belongs on [Code Review](https://codereview.stackexchange.com). – aschipfl Sep 19 '19 at 07:16

2 Answers2

58

It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P
{
    static int Gather(int[] source, int[] index, int[] results, bool avx)
    {   // normally you wouldn't have avx as a parameter; that is just so
        // I can turn it off and on for the test; likewise the "int" return
        // here is so I can monitor (in the test) how much we did in the "old"
        // loop, vs AVX2; in real code this would be void return

        int y = 0;
        if (Avx2.IsSupported && avx)
        {
            var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
            var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

            unsafe
            {
                fixed (int* sPtr = source)
                {
                    // note: here I'm assuming we are trying to fill "results" in
                    // a single outer loop; for a double-loop, you'll probably need
                    // to slice the spans
                    for (int i = 0; i < rv.Length; i++)
                    {
                        rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
                    }
                }
            }
            // move past everything we've processed via SIMD
            y += rv.Length * Vector256<int>.Count;
        }
        // now do anything left, which includes anything not aligned to 256 bits,
        // plus the "no AVX2" scenario
        int result = y;
        int end = results.Length; // hoist, since this is not the JIT recognized pattern
        for (; y < end; y++)
        {
            results[y] = source[index[y]];
        }
        return result;
    }

    static void Main()
    {
        // invent some random data
        var rand = new Random(12345);
        int size = 1024 * 512;
        int[] data = new int[size];
        for (int i = 0; i < data.Length; i++)
            data[i] = rand.Next(255);

        // build a fake index
        int[] index = new int[1024];
        for (int i = 0; i < index.Length; i++)
            index[i] = rand.Next(size);

        int[] results = new int[1024];

        void GatherLocal(bool avx)
        {
            // prove that we're getting the same data
            Array.Clear(results, 0, results.Length);
            int from = Gather(data, index, results, avx);
            Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
            for (int i = 0; i < 32; i++)
            {
                Console.Write(results[i].ToString("x2"));
            }
            Console.WriteLine();

            const int TimeLoop = 1024 * 512;
            var watch = Stopwatch.StartNew();
            for (int i = 0; i < TimeLoop; i++)
                Gather(data, index, results, avx);
            watch.Stop();
            Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
            Console.WriteLine();
        }
        GatherLocal(false);
        if (Avx2.IsSupported) GatherLocal(true);
    }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Awesome! I think that's what I need! I'm gonna give it a go. – RobotRock Sep 18 '19 at 12:12
  • 1
    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly – Marc Gravell Sep 18 '19 at 12:13
  • Single loop is fine for me too. The double was more legacy than anything else. – RobotRock Sep 18 '19 at 12:14
  • @RobotRock note: still working on it, but it is possible that `scale` should be `4`, not `1` – Marc Gravell Sep 18 '19 at 12:35
  • No worries, it will take me a while to read up on using Avx2 with C# and the gather instruction anyway. At least I know where to look. Thanks. – RobotRock Sep 18 '19 at 12:47
  • @RobotRock some slightly bad news - my bad; fixed – Marc Gravell Sep 18 '19 at 12:49
  • 2
    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer – Marc Gravell Sep 18 '19 at 12:51
  • Can multiple threads running on different cores help? Or the RAM is the bottleneck? – ca9163d9 Sep 20 '19 at 22:13
  • @ca9163d9 that would be a great thing to measure! For avx2, I suspect it may help a little, not a lot. For avx512 (not currently exposed, plus few machines support it): I wouldn't expect it to help, since the avx512 itself can be pretty dominant – Marc Gravell Sep 21 '19 at 04:03
-2

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.

Christopher
  • 9,634
  • 2
  • 17
  • 31
  • 1
    while a GPU might be nice here, modern CPUs *also* have operations specifically to support this scenario - this is a "gather" (see answer above) – Marc Gravell Sep 18 '19 at 12:33