11

I need to find the intersection of two sorted integer arrays and do it very fast.

Right now, I am using the following code:

int i = 0, j = 0;

while (i < arr1.Count && j < arr2.Count)
{
    if (arr1[i] < arr2[j])
    {
        i++;
    }
    else
    {
        if (arr2[j] < arr1[i])
        {
            j++;
        }
        else 
        {
            intersect.Add(arr2[j]);
            j++;
            i++;
        }
    }
}

Unfortunately it might to take hours to do all work.

How to do it faster? I found this article where SIMD instructions are used. Is it possible to use SIMD in .NET?

What do you think about:

http://docs.go-mono.com/index.aspx?link=N:Mono.Simd Mono.SIMD

http://netasm.codeplex.com/ NetASM(inject asm code to managed)

and something like http://www.atrevido.net/blog/PermaLink.aspx?guid=ac03f447-d487-45a6-8119-dc4fa1e932e1

 

EDIT:

When i say thousands i mean following (in code)

for(var i=0;i<arrCollection1.Count-1;i++)
{
    for(var j=i+1;j<arrCollection2.Count;j++)
    {
        Intersect(arrCollection1[i],arrCollection2[j])  
    }
}
Neir0
  • 12,849
  • 28
  • 83
  • 139
  • Don't you want to `break` from the loop after you have found the intersection? – Brendan Lesniak Jun 02 '12 at 23:42
  • @Brendan But how can i detect this moment? – Neir0 Jun 02 '12 at 23:45
  • 3
    Your title says "two" but your question says "thousands". Can you describe what you're trying to do? There might be a better way instead of comparing two at a time. – Mark Byers Jun 02 '12 at 23:46
  • Well what is considered `intersection`? The moment the value in the first array at location i is greater than the value in a second array? - if it is thousands, SIMD, might be the way to go – Brendan Lesniak Jun 02 '12 at 23:47
  • So your question is basically asking us to go read the off-site pages and offer opinions on them for use in your code? – Ken White Jun 02 '12 at 23:47
  • Can you post a small portion of code showing the expected output – Jupaol Jun 02 '12 at 23:48
  • 2
    Maybe HashSet is better data structure – Lukasz Madon Jun 02 '12 at 23:50
  • @lukas i tried HashSet but it worked slowly – Neir0 Jun 02 '12 at 23:52
  • HashSet has an intersection method, @lukas is on the money – Jesse Jun 02 '12 at 23:53
  • @diolemo averange length 15-30 items. About one billion arrays total(well it's actually depending on input data) – Neir0 Jun 02 '12 at 23:54
  • @Jesse yes but it is not work good http://codebetter.com/patricksmacchia/2011/06/16/linq-intersect-2-7x-faster-with-hashset/ And i want a much more faster code. – Neir0 Jun 02 '12 at 23:55
  • @Neir0, an interesting read, thanks for the link. Some food for thought as I was considering using hashsets for one of my projects. – Jesse Jun 03 '12 at 00:00
  • @Jupaol Expected output? Well...intersection of two arrays...what do you mean? – Neir0 Jun 03 '12 at 00:06
  • @Ken White i have done some research before asking a question and post a links which might be useful. – Neir0 Jun 03 '12 at 00:08
  • Your question says: "What do you think about" and then lists three off-site (not StackOverflow) links. That seems to me you're asking for people to go and read them and then tell you their opinion of those site's methods. – Ken White Jun 03 '12 at 00:15
  • @Neir0, would you mind giving me your opinion about my answer? – Sebas Jun 03 '12 at 00:40
  • @Sebas Can you pls post some code? I do not understand what do you mean. – Neir0 Jun 03 '12 at 00:43
  • ok, before I do so, could you confirm that arr1 and arr2 in your example are simple ordered arrays of integers? – Sebas Jun 03 '12 at 00:46
  • Is there always one and only one intersection? – Sebas Jun 03 '12 at 01:07
  • @Sebas One intersections for each pair of arrays. – Neir0 Jun 03 '12 at 01:11
  • @Sebas And by intersection, he means a collection of elements (integers) that the two arrays share in common. (Corrent me if I'm wrong, Neir0) – SimpleVar Jun 03 '12 at 01:28
  • If so I'm lost, since he said the arrays contain integers and have one and only one intersection (between a given pair of arrays) – Sebas Jun 03 '12 at 01:30
  • @Sebas As I understand this, he meant that there each two pair of arrays result with a single output collection of common values. By the way, make sure to put "@name" in comments to inform others of your comment. – SimpleVar Jun 03 '12 at 02:11
  • Clarification: can you have duplicates in your list? For instance, what is the intersection of [0;1;1;2] and [1;1;2;3]? Is is [1;2] or [1;1;2]? – Mathias Jun 03 '12 at 04:22
  • @Mathias No. [0;1;1;2] is impossible list. Each array contains unique items set. – Neir0 Jun 03 '12 at 04:58

5 Answers5

10

UPDATE

The fastest I got was 200ms with arrays size 10mil, with the unsafe version (Last piece of code).

The test I've did:

var arr1 = new int[10000000];
var arr2 = new int[10000000];

for (var i = 0; i < 10000000; i++)
{
    arr1[i] = i;
    arr2[i] = i * 2;
}

var sw = Stopwatch.StartNew();

var result = arr1.IntersectSorted(arr2);

sw.Stop();

Console.WriteLine(sw.Elapsed); // 00:00:00.1926156

Full Post:

I've tested various ways to do it and found this to be very good:

public static List<int> IntersectSorted(this int[] source, int[] target)
{
    // Set initial capacity to a "full-intersection" size
    // This prevents multiple re-allocations
    var ints = new List<int>(Math.Min(source.Length, target.Length));

    var i = 0;
    var j = 0;

    while (i < source.Length && j < target.Length)
    {
        // Compare only once and let compiler optimize the switch-case
        switch (source[i].CompareTo(target[j]))
        {
            case -1:
                i++;

                // Saves us a JMP instruction
                continue;
            case 1:
                j++;

                // Saves us a JMP instruction
                continue;
            default:
                ints.Add(source[i++]);
                j++;

                // Saves us a JMP instruction
                continue;
        }
    }

    // Free unused memory (sets capacity to actual count)
    ints.TrimExcess();

    return ints;
}

For further improvement you can remove the ints.TrimExcess();, which will also make a nice difference, but you should think if you're going to need that memory.

Also, if you know that you might break loops that use the intersections, and you don't have to have the results as an array/list, you should change the implementation to an iterator:

public static IEnumerable<int> IntersectSorted(this int[] source, int[] target)
{
    var i = 0;
    var j = 0;

    while (i < source.Length && j < target.Length)
    {
        // Compare only once and let compiler optimize the switch-case
        switch (source[i].CompareTo(target[j]))
        {
            case -1:
                i++;

                // Saves us a JMP instruction
                continue;
            case 1:
                j++;

                // Saves us a JMP instruction
                continue;
            default:
                yield return source[i++];
                j++;

                // Saves us a JMP instruction
                continue;
        }
    }
}

Another improvement is to use unsafe code:

public static unsafe List<int> IntersectSorted(this int[] source, int[] target)
{
    var ints = new List<int>(Math.Min(source.Length, target.Length));

    fixed (int* ptSrc = source)
    {
        var maxSrcAdr = ptSrc + source.Length;

        fixed (int* ptTar = target)
        {
            var maxTarAdr = ptTar + target.Length;

            var currSrc = ptSrc;
            var currTar = ptTar;

            while (currSrc < maxSrcAdr && currTar < maxTarAdr)
            {
                switch ((*currSrc).CompareTo(*currTar))
                {
                    case -1:
                        currSrc++;
                        continue;
                    case 1:
                        currTar++;
                        continue;
                    default:
                        ints.Add(*currSrc);
                        currSrc++;
                        currTar++;
                        continue;
                }
            }
        }
    }

    ints.TrimExcess();
    return ints;
}

In summary, the most major performance hit was in the if-else's. Turning it into a switch-case made a huge difference (about 2 times faster).

SimpleVar
  • 14,044
  • 4
  • 38
  • 60
  • var i = 0; is not the best case scenario to use var keyword IMO. Could you post/link to timing and tests? – Lukasz Madon Jun 03 '12 at 00:38
  • @lukas Added the test I've used in the beginning of the post. Can you tell me why var isn't suitable here? Are you talking about readability? – SimpleVar Jun 03 '12 at 00:49
  • I question your measurements... At least your first example has its timing highly dependent on the number of ints that matched. – Ilia G Jun 03 '12 at 00:52
  • @IliaG How come? Resizing the list from 10000000 to 100 doesn't differ from resizing it from 10000000 to 100000. I ran the test with arrays all zero's as well, and with non-intersecting arrays also - same timing (and even better timing if the arrays are non-intersecting AND in difference ranges) – SimpleVar Jun 03 '12 at 00:53
  • I don't know. Your sample code produces exactly 23 matches. Changing your `int` generating code to use `Random` creates highly variable results. – Ilia G Jun 03 '12 at 00:58
  • 1
    @IliaG It produces a list of 5000000, actually. But I will test it with randoms as well. Results are indeed different, but 330ms is still great! You can see the random-generated test [here](http://pastebin.com/UvuPpXAC). – SimpleVar Jun 03 '12 at 00:59
  • Thanks! Very good job but i hope to found approach which give more then 2x improvement – Neir0 Jun 03 '12 at 00:59
  • @Neir0 There is so much one can improve a small piece of code. For more improvement, you should work on your general algorithm and approach. – SimpleVar Jun 03 '12 at 01:04
  • @Yorye yes, it's less readable. Maybe adding `const` for locals that you don't mutate could help(unsafe versions). – Lukasz Madon Jun 03 '12 at 01:06
  • @lukas Are you referring to the max-variables? This is C#, remember. You can't just make things const - they have to be const! And regarding the `var` thing - that's just about one's preferences. – SimpleVar Jun 03 '12 at 01:10
  • @Yorye I don't know but sometimes in C++ const pointer allow compiler some small optimizations. – Lukasz Madon Jun 03 '12 at 01:16
  • @lukas True, but you can't do that in C#. – SimpleVar Jun 03 '12 at 01:18
  • @YoryeNathan Random values will not generate sorted arrays – anouar.bagari Jun 03 '12 at 01:32
  • @anouar204 We are talking about random+sorted. See the pastebin I linked to about 8 comments ago - I generate random lists and then sort them, and them put them into arrays. – SimpleVar Jun 03 '12 at 01:35
  • @YoryeNathan I think the `j=i` is not working, perhaps it works in your example because your arrays contain the same data therefore they have the same length just by incrementing `j++; i++;` works just great – Jupaol Jun 03 '12 at 01:48
  • @Jupaol You're right, I don't know what I was thinking. Fixed, thanks! – SimpleVar Jun 03 '12 at 01:50
  • i have very large two array this algorithm really make it faster but unsafe one results me different values. Can you assure it works as the same as safe one? – Kemal Can Kara Mar 29 '16 at 17:27
  • 1
    @KemalCanKara At a glance, I found a problem that `var maxTarAdr = ptTar + source.Length;` should actually be `var maxTarAdr = ptTar + target.Length;` -- oops! – SimpleVar Mar 29 '16 at 18:56
2

Have you tried something simple like this:

var a = Enumerable.Range(1, int.MaxValue/100).ToList();
var b = Enumerable.Range(50, int.MaxValue/100 - 50).ToList();

//var c = a.Intersect(b).ToList();
List<int> c = new List<int>();

var t1 = DateTime.Now;

foreach (var item in a)
{
    if (b.BinarySearch(item) >= 0)
        c.Add(item);
}

var t2 = DateTime.Now;

var tres = t2 - t1;

This piece of code takes 1 array of 21,474,836 elements and the other one with 21,474,786

If I use var c = a.Intersect(b).ToList(); I get an OutOfMemoryException

The result product would be 461,167,507,485,096 iterations using nested foreach

But with this simple code, the intersection occurred in TotalSeconds = 7.3960529 (using one core)

Now I am still not happy, so I am trying to increase the performance by breaking this in parallel, as soon as I finish I will post it

Jupaol
  • 21,107
  • 8
  • 68
  • 100
  • Binary search give O(m*lg(n)) when my approach O(m+n). So binary search good for very long arrays. In my case i have short arrays(15-30 elements) – Neir0 Jun 03 '12 at 00:51
  • mmm But you said thousands...at first lol I just checked your edit... as you can see in my example I am talking about an intersection between 20 millions vs another array of 20 millions + without out of memory exceptions and in a razonable amount of time (7 sec) – Jupaol Jun 03 '12 at 00:55
  • +1. In case anyone is interested, this is indeed a good approach for computing the intersection between sorted arrays where m << n, but can be significantly improved by re-using the insertion index returned by [BinarySearch](http://msdn.microsoft.com/en-us/library/a1s5syxa.aspx) as an argument back to it to re-start the search from there, instead of searching the entire set again. –  Aug 03 '13 at 02:57
1

Yorye Nathan gave me the fastest intersection of two arrays with the last "unsafe code" method. Unfortunately it was still too slow for me, I needed to make combinations of array intersections, which goes up to 2^32 combinations, pretty much no? I made following modifications and adjustments and time dropped to 2.6X time faster. You need to make some pre optimization before, for sure you can do it some way or another. I am using only indexes instead the actual objects or ids or some other abstract comparison. So, by example if you have to intersect big number like this

Arr1: 103344, 234566, 789900, 1947890, Arr2: 150034, 234566, 845465, 23849854

put everything into and array

Arr1: 103344, 234566, 789900, 1947890, 150034, 845465,23849854

and use, for intersection, the ordered indexes of the result array

Arr1Index: 0, 1, 2, 3 Arr2Index: 1, 4, 5, 6

Now we have smaller numbers with whom we can build some other nice arrays. What I did after taking the method from Yorye, I took Arr2Index and expand it into, theoretically boolean array, practically into byte arrays, because of the memory size implication, to following:

Arr2IndexCheck: 0, 1, 0, 0, 1, 1 ,1

that is more or less a dictionary which tells me for any index if second array contains it. The next step I did not use memory allocation which also took time, instead I pre-created the result array before calling the method, so during the process of finding my combinations I never instantiate anything. Of course you have to deal with the length of this array separately, so maybe you need to store it somewhere.

Finally the code looks like this:

    public static unsafe int IntersectSorted2(int[] arr1, byte[] arr2Check, int[] result)
    {
        int length;

        fixed (int* pArr1 = arr1, pResult = result)
        fixed (byte* pArr2Check = arr2Check)
        {
            int* maxArr1Adr = pArr1 + arr1.Length;
            int* arr1Value = pArr1;
            int* resultValue = pResult;

            while (arr1Value < maxArr1Adr)
            {
                if (*(pArr2Check + *arr1Value) == 1)
                {
                    *resultValue = *arr1Value;
                    resultValue++;
                }

                arr1Value++;
            }

            length = (int)(resultValue - pResult);
        }

        return length;
    }

You can see the result array size is returned by the function, then you do what you wish(resize it, keep it). Obviously the result array has to have at least the minimum size of arr1 and arr2.

The big improvement, is that I only iterate through the first array, which would be best to have less size than the second one, so you have less iterations. Less iterations means less CPU cycles right?

So here is the really fast intersection of two ordered arrays, that if you need a reaaaaalllyy high performance ;).

Stefan Pintilie
  • 677
  • 8
  • 5
0

Are arrCollection1 and arrCollection2 collections of arrays of integers? IN this case you should get some notable improvement by starting second loop from i+1 as opposed to 0

Ilia G
  • 10,043
  • 2
  • 40
  • 59
0

C# doesn't support SIMD. Additionally, and I haven't yet figured out why, DLL's that use SSE aren't any faster when called from C# than the non-SSE equivalent functions. Also, all SIMD extensions that I know of don't work with branching anyway, ie your "if" statements.

If you're using .net 4.0, you can use Parallel For to gain speed if you have multiple cores. Otherwise you can write a multithreaded version if you have .net 3.5 or less.

Here is a method similar to yours:

    IList<int> intersect(int[] arr1, int[] arr2)
    {
        IList<int> intersect = new List<int>();
        int i = 0, j = 0;
        int iMax = arr1.Length - 1, jMax = arr2.Length - 1;
        while (i < iMax && j < jMax)
        {
            while (i < iMax && arr1[i] < arr2[j]) i++;
            if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
            while (i < iMax && arr1[i] == arr2[j]) i++; //prevent reduntant entries
            while (j < jMax && arr2[j] < arr1[i]) j++;
            if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
            while (j < jMax && arr2[j] == arr1[i]) j++; //prevent redundant entries
        }
        return intersect;
    }

This one also prevents any entry from appearing twice. With 2 sorted arrays both of size 10 million, it completed in about a second. The compiler is supposed to remove array bounds checks if you use array.Length in a For statement, I don't know if that works in a while statement though.

HypnoToad
  • 585
  • 1
  • 6
  • 18
  • Actually i already use TPL (not in intersect function but when loop over all arrays). Why "DLL's that use SSE aren't any faster when called from C# than the non-SSE equivalent functions" ? – Neir0 Jun 03 '12 at 00:24
  • That's great for Intersect + Distinct, but if he only wants intersects then 1 second for 10mil elements is pretty slow. My implementation does that in 150ms (tested with 2 arrays, each 10mil elements, first being {0,1,2,3,...} and second being {0,2,4,6,8,...}). – SimpleVar Jun 03 '12 at 00:26
  • I haven't figured out why the DLL was slower when called from C# but all other things being equal, the SSE calls were about 3x faster in c++ and about the same speed when called from C#. It may have something to do with pinvoke, but the datasets were very large. ymmv. – HypnoToad Jun 03 '12 at 00:36
  • Yorye if you have a faster method you should post it. – HypnoToad Jun 03 '12 at 00:36
  • @Doctor Zero Ok, i understand. I can pass not only 2 arrays to external function but all my arrays and do all work in external dll. So cost of one Pinvoke call will be low. – Neir0 Jun 03 '12 at 00:41
  • @DoctorZero I did. You can see it above, currently sorted as first answer. – SimpleVar Jun 03 '12 at 01:25
  • @Neir0, regarding the SSE calls. I would be very curious to know if you get the same performance boost from C# as you get from C++. When I called SSE functions from C# (the loops were inside the functions, of course) I saw no performance gain. Never figured out why. – HypnoToad Jun 03 '12 at 02:13