1

Let's we have some integer short sorted arrays and we need to find intersection equal or more then predefined constant. Here is code and it demonstrates what i want to do better then i can explain it in words. The problem is SPEED. My code is working very slow. It takes about 15 sec on 2000 elements array(on my slow machine). Ofcourse i can implement my own intersection method and parallize code but it give a very limited improvement. Execution time growing as N^2 or something and already for 500k arrays it takes a very very long time. So how can i rewrite algorithm for better perfomance? I am not limited c# language maybe CPU or GPU has good special instructions for such job.

Example:

Input:
1,3,7,8
2,3,8,10
3,10,11,12,13,14

minSupport = 1

Output:

1 and 2: 2, 8
1 and 3: 3
2 and 3: 3, 10

    var minSupport = 2;
    var random = new Random(DateTime.Now.Millisecond);

    // Numbers is each array are unique
    var sortedArrays = Enumerable.Range(0,2000)
    .Select(x => Enumerable.Range(0,30).Select(t => random.Next(1000)).Distinct()
    .ToList()).ToList();
    var result = new List<int[]>();
    var resultIntersection = new List<List<int>>();

    foreach (var array in sortedArrays)
    {
        array.Sort();
    }

    var sw = Stopwatch.StartNew();

    //****MAIN PART*****//

    for (int i = 0; i < sortedArrays.Count-1; i++)
    {
        for (int j = i+1; j < sortedArrays.Count; j++)
        {
            var intersect = sortedArrays[i].Intersect(sortedArrays[j]).ToList();
            if(intersect.Count()>=minSupport)
            {
                result.Add( new []{i,j});
                resultIntersection.Add(intersect);
            }
        }
    }

    //*****************//

    sw.Stop();

    Console.WriteLine(sw.Elapsed);

EDIT:

Now it takes about 9 sec vs 15 sec with old algorithm on 2000 elements. Well...ofcourse it is not fast enough.

//****MAIN PART*****//

    // This number(max value which array can contains) is known
    var maxValue = 1000;

    var reverseIndexDict = new Dictionary<int,List<int>>();

    for (int i = 0; i < maxValue; i++)
    {
        reverseIndexDict[i] = new List<int>();
    }

    for (int i = 0; i < sortedArrays.Count; i++)
    {
        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            reverseIndexDict[sortedArrays[i][j]].Add(i);
        }
    }

    var tempArr = new List<int>();
    for (int i = 0; i < sortedArrays.Count; i++)
    {
        tempArr.Clear();
        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            tempArr.AddRange(reverseIndexDict[j]);
        }

        result.AddRange(tempArr.GroupBy(x => x).Where(x => x.Count()>=minSupport).Select(x => new[]{i,x.Key}).ToList());

    }

    result = result.Where(x => x[0]!=x[1]).ToList();


    for (int i = 0; i < result.Count; i++)
    {
        resultIntersection.Add(sortedArrays[result[i][0]].Intersect(sortedArrays[result[i][1]]).ToList());
    }



    //*****************//

EDIT:

Some improvent.

//****MAIN PART*****//

    // This number(max value which array can contains) is known
    var maxValue = 1000;

    var reverseIndexDict = new List<int>[maxValue];

    for (int i = 0; i < maxValue; i++)
    {
        reverseIndexDict[i] = new List<int>();
    }

    for (int i = 0; i < sortedArrays.Count; i++)
    {
        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            reverseIndexDict[sortedArrays[i][j]].Add(i);
        }
    }



    for (int i = 0; i < sortedArrays.Count; i++)
    {
        var tempArr = new Dictionary<int, List<int>>();

        for (int j = 0; j < sortedArrays[i].Count; j++)
        {
            var sortedArraysij = sortedArrays[i][j];


            for (int k = 0; k < reverseIndexDict[sortedArraysij].Count; k++)
            {
                if(!tempArr.ContainsKey(reverseIndexDict[sortedArraysij][k]))
                {
                    tempArr[reverseIndexDict[sortedArraysij][k]] = new[]{sortedArraysij}.ToList();
                }
                else
                {
                   tempArr[reverseIndexDict[sortedArraysij][k]].Add(sortedArrays[i][j]);
                }

            }
        }


        for (int j = 0; j < reverseIndexDict.Length; j++)
        {
            if(reverseIndexDict[j].Count>=minSupport)
            {
                result.Add(new[]{i,j});
                resultIntersection.Add(reverseIndexDict[j]);
            }
        }

    }

    // and here we are filtering collections

    //*****************//
Neir0
  • 12,849
  • 28
  • 83
  • 139
  • Have you considered converting your lists to `HashSet`s? You appear to only use set-like calls to inspect your lists, so a set affords you all the functionality you need. `Intersect()` does this internally, but it's being called repeatedly for each list. **Edit**: I just realized your lists may contain duplicates and you could have those duplicates intersect, which would render different results if a set was used. – cheeken Jun 05 '12 at 01:09
  • See http://stackoverflow.com/questions/10866756/fast-intersection-of-two-sorted-integer-arrays – Ian Mercer Jun 05 '12 at 01:09
  • @Ian Mercer Thank you, it's another my question. After this question i understand that fast intersection of TWO arrays is not enough and i really must to use another approach. – Neir0 Jun 05 '12 at 01:17
  • @cheeken I thinked about HashSet but i donnt know right way to use it in my case. – Neir0 Jun 05 '12 at 01:20

1 Answers1

0

There are two solutions:

  1. Let us suppose you have 3 sorted arrays and you have to find the intersection between them. Traverse the first array and run a binary search on the rest of the two arrays for the element in first array. If the respective binary search on two list gave positive, then increment the counter of intersection.

    result = List
    for element in Array1:
        status1 = binarySearch(element, Array2)
        status2 = binarySearch(element, Array2)
        status = status & status
        if status == True:
            count++
            if count == MAX_INTERSECTION:
                result.append(element)
                break
    

    Time Complexity : N * M * Log(N),
    where,
    N = Number of element in the array
    M = Number of arrays

  2. This solution works only if the number in the arrays are positive integers. Calculate the maximum and the minimum number out of the total elements in all the sorted arrays. As it is sorted, we can determine it by surveying the start and end element of the sorted arrays given. Let the greatest number be max and the lowest number be min. Create an array of size max - min and fill it with zero. Let us suppose you have 3 Arrays, now start traversing the first array and and go to the respective index and increment the value in the previously created array. As mentioned below:

    element is 5 in Array 1, the New_array[5]+=1
    

    Traverse all the three sorted list and perform the operation mentioned above. At the end traverse the new_array and look for value equal to 3, these indexes are the intersection result.

    Time Complexity : O(N) + O(N) + .. = O(N)
    Space Complexity : O(maximum_element - minimum_element)
    where,
    N = number of elements in the array.

dan-boa
  • 590
  • 4
  • 10
  • Well...in case 1) N*M*Log(N) is only for one array. If i want to check all arrays it takes N*M*M*Log(N). in case 2) I cannt understand how this solution work with more then 3 arrays. I cannt use only one index array for all arrays. So...This solution cannt be O(N). – Neir0 Jun 05 '12 at 10:04
  • In case of solution 1) Log(N) + Log(N) + ....... times the (number of arrays -1) = M * Log(N), this for each element of the first sorted array gives N * M * Log(N) – dan-boa Jun 05 '12 at 10:32
  • @ dan-boa yes, for first array. But i need to found intersection of all arrays not only first. – Neir0 Jun 05 '12 at 10:38
  • In case of solution 2) We have to use one index array for all the arrays as we increment the index as many times as the arrays has the number. Array1 = [0,1,3,5]; Array2 = [ 3,5,6]; Array3 = [ 2,3,4,5 ], Array4 = [ 3,5] then the new_array (length = 7 )will be [1,1,1,4,1,4,1], therefore complexity is M*N where M is the number of arrays and N is the number of element in the array – dan-boa Jun 05 '12 at 10:41
  • @Neir0 my mistake, i thought you needed to find a global intersection with a constraint of count. You actually need all the possible combination of intersection above the constraint. – dan-boa Jun 05 '12 at 10:49