How to find the rank of an element in a stream of numbers efficiently?

Question

Recently I'm trying to find the median of a stream of numbers with the following conditions:

3-pass algorithm
O(nlog(n)) time
O(sqrt(n)) space

The input is repeated 3 times, including n, the number of integers, followed by n integers a_i such that:

n is odd
1≤n≤10^7
|a_i| ≤ 2^{30}

The format of an input data is shown as follows:

My code so far is shown as follows:

#ifdef STREAMING_JUDGE
#include "io.h"
#define next_token io.next_token
#else
#include<string>
#include<iostream>
using namespace std; 
string next_token()
{
    string s;
    cin >> s;
    return s;
}
#endif

#include<cstdio>
#include<cstdlib>
#include<vector>
#include<algorithm>
#include<iostream>
#include<math.h>

using namespace std;

int main()
{
    srand(time(NULL));
    //1st pass: randomly choose sqrt(n) numbers from the given stream of numbers
    int n = atoi(next_token().c_str());
    int p = (int)ceil(sqrt(n));
    vector<int> a;
    for(int i=0; i<n; i++)
    {
        int s=atoi(next_token().c_str());
        if( rand()%p == 0 && (int)a.size() < p )
        {
            a.push_back(s);
        }
    }
    sort(a.begin(), a.end());
    //2nd pass: find the k such that the median lies in a[k] and a[k+1], and find the rank of the median between a[k] and a[k+1]
    next_token();
    vector<int> rank(a.size(),0);
    for( int j = 0; j < (int)a.size(); j++ )
    {
        rank.push_back(0);
    }
    for( int i = 0; i < n; i++ )
    {
        int s=atoi(next_token().c_str());
        for( int j = 0; j < (int)rank.size(); j++ )
        {
            if( s<=a[j] )
            {
                rank[j]++;
            }
        }
    }
    int median = 0;
    int middle = (n+1)/2;
    int k;
    if( (int)a.size() == 1 && rank.front() == middle )
    {
        median=a.front();
        cout << median << endl;
        return 0;
    }
    for( int j = 0; j < (int)rank.size(); j++ )
    {
        if( rank[j] == middle )
        {
            cout << rank[j] << endl;
            return 0;
        }
        else if( rank[j] < middle && rank[j+1] > middle )
        {
            k = j;
            break;
        }
    }
    //3rd pass: sort the numbers in (a[k], a[k+1]) to find the median
    next_token();
    vector<int> FinalRun;
    if( rank.empty() )
    {
        for(int i=0; i<n; i++)
        {
            a.push_back(atoi(next_token().c_str()));
        }
        sort(a.begin(), a.end());
        cout << a[n>>1] << endl;
        return 0;
    }
    else if( rank.front() > middle )
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s < a.front() )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-1] << endl;
        return 0;
    }
    else if ( rank.back() < middle )
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s > a.back() )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-rank.back()-1] << endl;
        return 0;
    }
    else
    {
        for( int i = 0; i < n; i++ )
        {
            int s = atoi(next_token().c_str());
            if( s > a[k] && s < a[k+1] )  FinalRun.push_back(s);
        }
        sort( FinalRun.begin(), FinalRun.end() );
        cout << FinalRun[middle-rank[k]-1] << endl;
        return 0;
    }
}

But I still cannot reach the O(nlogn) time complexity. I guess that the bottleneck is in the ranking part (i.e. finding the rank of the median in (a[k], a[k+1]) by finding the ranks of the sampled a[i]'s in the input stream of numbers.) in the 2nd pass. This part has O(nsqrt(n)) in my code.

But I have no idea about how to improve the efficiency of ranking...... Is there any suggestion for efficiency improvement? Thanks in advance!

Further explanation of "rank": the rank of a sampled number calculates the number of numbers in the stream less than or equal to the sampled number. For instance: In the input given as above, if the numbers a[0]=2, a[1]=4, and a[2]=5 are sampled, then rank[0]=2 because there are two numbers (1 and 2) in the stream less than or equal to a[0].

Thanks for all of your help. Especially @alexeykuzmin0 's suggestion can indeed speed up the 2nd pass to O(n*logn) time. But there is a remaining issue: In the 1st pass, I sample the numbers with the probability 1/sqrt(n). When there is no number sampled (the worst case), the vector a is empty, causing that the following passes cannot be executed (i.e., a segmentation fault (core dumped) occurs). @Aconcagua, what do do mean "select all remaining elements, if there aren't more than required any more"? Thanks.

I'm not quite sure what you're trying to do. If you just want to know which index an element has in the sorted list, why not just count the number of elements that compare less than the element? — Clearer, Apr 03 '18 at 09:45
@Clearer My code indeed calculates the rank of an element just as you said. Actually, in the 1st pass, I want to randomly choose sqrt(n) numbers from the given stream of numbers a[1],a[2],...,a[sqrt(n)]. Then in the 2nd pass I want to calculate the rank of the a[i]'s to find the index k such that a[k] — Vincent Tsai, Apr 03 '18 at 09:56
Could you use `std::nth_element`? Using that you can find the median and do a partial sorting of the elements. If `a[k] < median` must hold, then `k` must be one less than the index of the median; i.e. `k = floor(n / 2) - 1`. But if `median < a[k + 1]`, then `k` must be greater than the index of the median; this conflicts with `k` being one less than the same index. — Clearer, Apr 03 '18 at 10:11
Your algorithm to select `sqrt(n)` elements does not guarantee to do so, by the way; it just won't select *more* (if you are really unlucky, you could even select none at all...). To fix this, you need to select all remaining elements, if there aren't more than required any more... — Aconcagua, Apr 03 '18 at 10:51
`vector a( 0, n );` creates a vector with 0 values, all set to `n`; as resulting size == 0, equivalent to just `vector a;`; on the other hand: `vector rank;` + for loop is equivalent to simply `vector rank(a.size(), 0);` without loop... — Aconcagua, Apr 03 '18 at 11:02
`vector a; a.reserve(p);` to prevent (hidden) re-allocations within. — Aconcagua, Apr 03 '18 at 11:04
@alexeykuzmin0 Applied to vectors of size p, it is O(p) - with p == sqrt(n)... — Aconcagua, Apr 03 '18 at 11:37
Tip: if m is the median of the set {1,...,n}, then if you pack the numbers in sub ensembles {{1,2,3,...n_1},{n_1,....n_2},....{n_x...n}} then the subensemble that has m as a member is also the median of this super ensemble. — Oliv, Apr 03 '18 at 11:42
The `rank` vector must be one larger than the `a` vector: Sequence being `1, 2, 3, 4, 5`, i. e. `n == 5` -> `p == 3` and we happen so select 2, 3, 4, then 1 increases rank[0] and 5 should increment rank[p]... — Aconcagua, Apr 03 '18 at 11:55
@Aconcagua, what do do mean "select all remaining elements, if there aren't more than required any more"? Thanks. — Vincent Tsai, Apr 03 '18 at 19:09

score 2 · Accepted Answer · answered Apr 03 '18 at 11:44

2

You right, your second part works in O(n√n) time:

for( int i = 0; i < n; i++ )                    // <= n iterations
  ...
    for( int j = 0; j < (int)rank.size(); j++ ) // <= √n iterations

To fix this, we need to get rid of the inner loop. For example, instead of directly calculating amount of elements of initial array that are less than a threshold, we could first calculate amount of elements of the array falling into each interval:

// Same as in your code
for (int i = 0; i < n; ++i) {
    int s = atoi(next_token().c_str());
    // Find index of interval in O(log n) time
    int idx = std::upper_bound(a.begin(), a.end(), s) - a.begin();
    // Increase the rank of only that interval
    ++rank[idx];
}

And then calculate ranks of your threshold elements:

std::partial_sum(rank.begin(), rank.end(), rank.begin());

The resulting complexity is O(n log n) + O(n) = O(n log n).

Here I used two STL algorithms:

std::upper_bound which finds a first element in a sorted array which is strictly greater than given number in logarithmic time, using binary search method.
std::partial_sum which calculates partial sums of an array given in a linear time.

answered Apr 03 '18 at 11:44

alexeykuzmin0

6,344
2
28
51

Thanks for your answer. But the results are different from what I expected......Actually, the rank of a sampled number calculates the number of numbers in the stream less than or equal to the sampled number. For instance: In the input given as above, if the numbers a[0]=2, a[1]=4, and a[2]=5 are sampled, then rank[0]=2 because there are two numbers (1 and 2) in the stream less than or equal to a[0]. – Vincent Tsai Apr 03 '18 at 16:37
This code was to illustrate the idea, I didn't really check it. It may contain some +-1 bugs. Feel free to fix if you find any. – alexeykuzmin0 Apr 03 '18 at 16:39
Thanks for all of your help. Especially @alexeykuzmin0 's suggestion can indeed speed up the 2nd pass to O(n*logn) time. But there is a remaining issue: In the 1st pass, I sample the numbers with the probability 1/sqrt(n). When there is no number sampled (the worst case), the vector a is empty, causing that the following passes cannot be executed (i.e., a segmentation fault (core dumped) occurs). – Vincent Tsai Apr 03 '18 at 19:15
1

@VincentTsai Track the number you have observed so far (n_o) and the number you have sampled so far (n_s). Then sample each observation with probability `(sqrt(n) - n_s) / (n - n_o)`. This yields an overall probability of 1/sqrt(n), but in such a way that if sampling is initially sparse, the rate increases (eventually up to probability 1), vs. if you sample early the rate drops to zero as you approach a sample size of sqrt(n). – pjs Apr 03 '18 at 20:51
@pjs Thanks for your suggestion. This indeed helps the 1st pass work more smoothly – Vincent Tsai Apr 04 '18 at 14:53

How to find the rank of an element in a stream of numbers efficiently?

1 Answers1