Median of union of sorted arrays - what to do after recursion ends

Question

I apologize if this question does not belong here, my problem is not with the code, it's with the algorithm, so perhaps it is better suited for another website, but the good people of stackoverflow never let me down.

Here is the question:

Given 2 sorted arrays A and B such that they have the same number of elements, lets say n, and such that they do not share elements, and no element appears twice in the same array, find the median of the union of the arrays in logarithmic time complexity.

Very Important note: if n is odd, then the median is the middle element. But if n is even, the median is not the average of the middle elements. it is defined as the minimum of the middle elements.

Solution: The idea is quite simple. since they are sorted, we can find the median of A (called med1) and the median of B (called med2) in O(1). if med1>med2 then we know that the median of the union is an element of A that is smaller than med1 or an element of B that is larger than med2, and the reverse if med2>med1. So we throw away the redundant element and do the same process, until A and B are sufficiently small, say with 2 elements each, and then we just need to find the median between these 4 numbers. The median of 4 numbers would be the second minimum, since 4 is an even number, which would be O(1).

this is my code

#include<stdio.h>
#include<stdlib.h>
#include<conio.h>
int *scan_array(int* array_length);
int second_min_four_numbers(int a,int b,int c,int d);
int first_question(int *arr1,int *arr2,int left1,int right1,int left2,int right2);
void main()
{
    int *arr1,*arr2,length_arr1=0,length_arr2=0;
    printf("For the first sorted array:\n");
    arr1=scan_array(&length_arr1);
    printf("\nFor the second sorted array, enter %d numbers:\n",length_arr1);
    arr2=scan_array(&length_arr2);
    if(length_arr1==1) //edge case, arrays are length one. return the min
    {
        if(arr1[0] > arr2[0])
            printf("The Median is %d",arr2[0]);
        else
            printf("The Median is %d",arr1[0]);
    }
    else
        printf("The Median is %d",first_question(arr1,arr2,0,length_arr1-1,0,length_arr2-1));
    getch();
}
int *scan_array(int* array_length) //nothing fancy. just scan the arrays.
{
    int* temp,temp_length,array_element,i=0,*real_array;
    temp=(int*)malloc(50*sizeof(int));
    printf("Enter positive numbers. To stop enter negative or zero.\nDon't enter more than 50 numbers\n");
    scanf("%d",&array_element);
    while(array_element>0)
    {
        (*array_length)++;
        temp[i]=array_element;
        i++;
        scanf("%d",&array_element);
    }
    real_array=(int*)malloc((*array_length)*sizeof(int));
    for(i=0;i<*array_length;i++)
        real_array[i]=temp[i];
    free(temp);
    return real_array;
}
int first_question(int *arr1,int *arr2,int left1,int right1,int left2,int right2) 
{
    int med1,med2;
    if(right1-left1+right2-left2 == 2) //we are done. reached 4 elements. we will always be here for arrays larger than 1 element each
        return second_min_four_numbers(arr1[left1],arr1[right1],arr2[left2],arr2[right2]);
    med1=arr1[(left1+right1)/2]; //not done. find the medians in O(1).
    med2=arr2[(left2+right2)/2];
    if(med1 < med2)//the median of the union is somewhere between them
        return first_question(arr1,arr2,(left1+right1)/2,right1,left2,(left2+right2)/2);
    else
        return first_question(arr1,arr2,left1,(left1+right1)/2,(left2+right2)/2,right2);
}
int second_min_four_numbers(int a,int b,int c,int d) //find second min between four numbers
{
    int min=0,second_min=0; //very crude, and inefficient but simple to understand and still O(1)
    min = a;
    if(min > b)
        min = b;
    if(min > c)
        min = c;
    if(min > d)
        min = d;
    if(a == min) 
    {
        second_min=b;
        if(second_min > c)
            second_min = c;
        if(second_min > d)
            second_min = d;
        return second_min;
    }
    if(b == min)
    {
        second_min=a;
        if(second_min > c)
            second_min=c;
        if(second_min > d)
            second_min = d;
        return second_min;
    }
    if(c == min)
    {
        second_min=a;
        if(second_min > b)
            second_min = b;
        if(second_min > d)
            second_min = d;
        return second_min;
    }
    if(d == min)
    {
        second_min=a;
        if(second_min > b)
            second_min=b;
        if(second_min > c)
            second_min=c;
        return second_min;
    }
}

It is working as intended and compiles. As I said, the problem is not with my code, it's with the algorithm. Let's see an example that will demonstrate the problem:

Suppose our input was A=[1,3,5] and B=[2,4,6]. Then med1=3 and med2=4. Throw away the redundant elements and now we have A=[3,5] and B=[2,4]. Now we have only 4 elements overall, the data is sufficiently small, so just find the median of these 4 numbers [3,5,2,4]. The median would be 3, which is also the correct result for the median of the union of A and B, so the result is correct.

Now let's assume our input was A=[1,3,5,7] and B=[2,4,6,8]. med1=3 and med2=4. Throw away the redundant elements to get A=[3,5,7] and B=[2,4]. Now med1=5 and med2=2. Again throw away redundancy to get A=[3,5] and B=[2,4]. Now our data is sufficiently small, find the median of [3,5,2,4] which would again give us 3. But that result is incorrect. 3 is not the median of the union of A and B . The correct result would be 4.

How can we fix this problem?

As you say, the algorithm produces the wrong result for some -- indeed most -- inputs. As far as I can tell, there's no simple tweak that would fix the algorithm. You need a different approach. — John Bollinger, Apr 17 '15 at 19:24
I think it's just the end part that produces the problem. we will never remove the median of the union while in the recursion. the problem is finding it after the recursion ends. — Oria Gruber, Apr 17 '15 at 19:25
No, the algorithm is altogether wrong. For example, take A as the even integers from 2 through 20, and B the odd integers from 1 through 19. The correct median is 10, but that's completely out of consideration before you get to the final step. — John Bollinger, Apr 17 '15 at 19:27
I just had an idea! given a number x, we can check in O(1) complexity if x is the median of the union! We could just stop when we have 4 elements and then check each of those elements if its the med of union! — Oria Gruber, Apr 17 '15 at 19:28
Why is it completely out of consideration? When we end the recursion, we would only need to consider the numbers [8,10,9,11] — Oria Gruber, Apr 17 '15 at 19:32
No, the even integers and odd integers are disjoint. A and B share no elements in my example. It is possible, however, that I do not correctly understand your description of the algorithm. — John Bollinger, Apr 17 '15 at 19:32
step by step: A=[2,4,6,8,10,12,14,16,18,20] and B=[1,3,5,7,9,11,13,15,17,19]. remove redundancy to get A=[2,4,6,8,10] and B=[9,11,13,15,17,19]. Remove redundancy again to get A=[6,8,10] and B=[9,11,13,15]. remove redundancy again to get A=[8,10] and B=[9,11]. Exit recursion. After exiting the recursion, the correct result of 10, is still in our data. we didnt throw it away. — Oria Gruber, Apr 17 '15 at 19:36
Ok, I'm having trouble conceptualizing *why* the recursion step of the algorithm is right, but I acknowledge that it appears to be successfully narrowing down the median candidates to a set of 4 containing the correct median. I don't see how you could test each candidate in O(1), but you certainly could test each in O(log n), which would be sufficient. — John Bollinger, Apr 17 '15 at 19:47
the arrays are sorted. I could test in O(1) if x is larger than i elements in A and n-i elements in B. which would make him median. — Oria Gruber, Apr 17 '15 at 19:52
Got it, and got it. I was thinking that you would need to search A and B for the insertion positions of candidates, but you're right, you know what that position *has to be* for a given candidate to be the overall median, so you don't need to search at all, just test. And the recursion works because at each step you throw away the same number of elements above the median as you do below it. (And I haven't checked your code, but that's a detail that is susceptible to off-by-one errors.) — John Bollinger, Apr 17 '15 at 20:02

score 0 · Answer 1 · answered Apr 17 '15 at 20:53

The algorithm needs to implement a binary search for the median, i.e. propose a possible value for the median. If that value is too low, then choose a higher value on the next iteration. If too high, then choose a lower value.

At each iteration, we choose a candidate from A, and choose a candidate from B. The smaller candidate is proposed as the median, and evaluated. If the proposed median is too small, then all smaller values from A and B can be removed from consideration. Likewise, if the proposed median is too large, then larger values from A and B can be ignored.

For example, given A=[1,2,7,19,22] the candidate from A would be 7. Assume that B proposes a larger candidate, so 7 is chosen as the possible median. If 7 is too low, then we can eliminate all elements <= 7 in both A and B as possible candidates. So A becomes A=[1,2,7,{19,22}] where the elements in curly braces are the remaining possible candidates for the median. The process is repeated, but this time the candidate from A would be 19.

To continue the example, let's say that B=[20,25,26,27]. The proposed candidate from B is 25. A's candidate is lower so we evaluate 19. List A has 3 values lower than 19, and 1 higher. List B has 4 values higher. Total 3 lower, 5 higher. Conclusion: 19 is too low, so eliminate as possible candidates all numbers <= 19. After two passes we have

A=[1,2,7,19,{22}]  B=[{20,25,26,27}]

A's candidate is 22, B's is 25, propose 22 as the median. 22 is too high so numbers >= 22 can be ignored and we have

A=[1,2,7,19,{},22]  // 19 was too low and 22 was too high, so no candidates are left in A
B=[{20},25,26,27]   // 22 was too high, so the only remaining candidate in B is 20

20 is the only remaining candidate in either list, and is therefore the answer.

Edward Doolittle · Answer 2 · 2015-04-17T21:41:05.760

Let me suggest a different way of conceptualizing this problem. Suppose there are 4 elements in each array. Consider this grid:

a1 a2 a3 a4
b1 b2 b3 b4

We are looking for a line through the center of the arrangement, which guarantees that the number of entries left of the line and the number of entries right of the line are equal. Note also that there are two different horizontal lines as a possible way of dividing the entries (smaller above or smaller below). So the number of lines we need to consider is 5 in this case, n+1 in general. Now, a binary search through the lines ought to do the trick.

Median of union of sorted arrays - what to do after recursion ends

2 Answers2