33

I have an array, filled with integers. My job is to find majority element quickly for any part of an array, and I need to do it... log n time, not linear, but beforehand I can take some time to prepare the array.

For example:

1 5 2 7 7 7 8 4 6

And queries:

[4, 7] returns 7

[4, 8] returns 7

[1, 2] returns 0 (no majority element), and so on...

I need to have an answer for each query, if possible, it needs to execute fast.

For preparation, I can use O(n log n) time

Krzysztofik
  • 391
  • 3
  • 9
  • 2
    To clarify: query range `4-7` gives `7`, query range `4-8` gives `7`, query range `1-2` gives `0` (none). Is this correct? – Felix Glas Nov 03 '13 at 15:16
  • C++ 11 version i think, GCC 4.6.3 – Krzysztofik Nov 03 '13 at 15:17
  • Yes, that's exactly what it works, sorry for form, I thought it's gonna be more readable ;) – Krzysztofik Nov 03 '13 at 15:17
  • If it helps, I thought about something tree-connected, cause of needed time. – Krzysztofik Nov 03 '13 at 15:22
  • What are the constraints on the values in the array and the size of the array? – Armen Tsirunyan Nov 03 '13 at 15:29
  • They are both lower than 1 000 000. – Krzysztofik Nov 03 '13 at 15:33
  • @user2950055 And `value > 0` as you use `0` as "no majority element". – Felix Glas Nov 03 '13 at 15:36
  • Yes, of course. It's only a way to mark "no majority element" – Krzysztofik Nov 03 '13 at 15:39
  • 9
    What do you mean by "prepare the array"? How much time can you take? You could just "prepare" a 2D table containing all the answers, and then look them up in `O(1)` – Rob Nov 03 '13 at 15:41
  • @Rob, Good luck with million by million table. – chill Nov 03 '13 at 15:48
  • 3
    @Chill - should fit easily enough in 2 TB :-) – Rob Nov 03 '13 at 15:56
  • 2
    The complexity of just "finding the mode of an array" in general case is O(n log n) . You should allow at least O(n log n) for preparation. – Maxim Razin Nov 03 '13 at 16:09
  • @grep I wanted to avoid it, but... ok, I could be n log n – Krzysztofik Nov 03 '13 at 16:13
  • Are the values unsigned ? – Michael M. Nov 03 '13 at 16:32
  • @Michael Yes, 1-1000000 – Krzysztofik Nov 03 '13 at 16:35
  • (a) Do you want the most frequently occuring element, or some relaxed majority, e.g. more than half occurrences? (b) The cartesian tree response to [finding minimum in a range](http://stackoverflow.com/questions/19756489/lowest-value-in-range) is very clever, but even that seems insufficient to pull off maximal occurrences. This suggests that O(lg n) may too stringent a requirement for finding the most frequent element, since `majority` requires more information than `min`: `min` can be computed in O(n) for a unordered list of elements, but `majority` needs O(n) on a sorted list. – Senti Bachcha Nov 04 '13 at 01:51
  • 4
    It may be useful to know that the type of query you're interested in is called a Range Query. If you have a list and you wish to find the minimum over a range, it's called Range Minimum Query. If you wish to find the median over a range, Range Median Query. In this case: Range Mode Query. With that said, you'll find more information here: http://en.wikipedia.org/wiki/Range_Queries#Mode – bdean20 Nov 07 '13 at 06:46
  • 1
    What are the constraints on the number of different values? Should we assume that it's linear with n? – Beta Nov 07 '13 at 06:50
  • @bdean20: But the Wiki doesn't mention the algorithm, and I don't see any O(n log n) preparation with O(log n) per query algorithm. And we're not looking for the mode, but the majority. Majority element needs to be more than half of the number of elements in the range. – justhalf Nov 07 '13 at 09:06
  • 1
    I think you can make this work by applying divide and conquer in the preparation and applying http://stackoverflow.com/questions/4325200/find-majority-element-in-array/9487018#9487018 for each part of the array, storing the results in a binary tree. But @justhalf rightly points out holes in my reasoning. Will ponder on it some more, withdrawing the answer. – flup Nov 07 '13 at 09:32
  • I think I can do it in O((log n)^2), is it good enough? – n. m. could be an AI Nov 07 '13 at 10:06
  • This should be rather easy in O(m log n), where m is the length of the range. Is this O(log n) a very strong requirement? – Griwes Nov 07 '13 at 10:48
  • Could you specify exact difference between preparations and "the job" time? Why can't you find the majority element during preparations and then just extract it from memory? – klm123 Nov 07 '13 at 13:43
  • This is replying to a 10-year old comment, but here it goes: @klm123 because in "the job" there will be multiple queries. Each should take O(log n) time to answer. – justhalf Jun 19 '23 at 17:17

6 Answers6

16

O(log n) queries and O(n log n) preprocessing/space could be achieved by finding and using majority intervals with following properties:

  1. For each value from input array there may be one or several majority intervals (or there may be none if elements with these values are too sparse; we don't need majority intervals of length 1 because they may be useful only for query intervals of size 1 which are better handled as a special case).
  2. If query interval lies completely inside one of these majority intervals, corresponding value may be the majority element of this query interval.
  3. If there is no majority interval completely containing query interval, corresponding value cannot be the majority element of this query interval.
  4. Each element of input array is covered by O(log n) majority intervals.

In other words, the only purpose of majority intervals is to provide O(log n) majority element candidates for any query interval.

This algorithm uses following data structures:

  1. List of positions for each value from input array (map<Value, vector<Position>>). Alternatively unordered_map may be used here to improve performance (but we'll need to extract all keys and sort them so that structure #3 is filled in proper order).
  2. List of majority intervals for each value (vector<Interval>).
  3. Data structure for handling queries (vector<small_map<Value, Data>>). Where Data contains two indexes of appropriate vector from structure #1 pointing to next/previous positions of elements with given value. Update: Thanks to @justhalf, it is better to store in Data cumulative frequencies of elements with given value. small_map may be implemented as sorted vector of pairs - preprocessing will append elements already in sorted order and query will use small_map only for linear search.

Preprocessing:

  1. Scan input array and push current position to appropriate vector in structure #1.
  2. Perform steps 3 .. 4 for every vector in structure #1.
  3. Transform list of positions into list of majority intervals. See details below.
  4. For each index of input array covered by one of majority intervals, insert data to appropriate element of structure #3: value and positions of previous/next elements with this value (or cumulative frequency of this value).

Query:

  1. If query interval length is 1, return corresponding element of source array.
  2. For starting point of query interval get corresponding element of 3rd structure's vector. For each element of the map perform step 3. Scan all elements of the map corresponding to ending point of query interval in parallel with this map to allow O(1) complexity for step 3 (instead of O(log log n)).
  3. If the map corresponding to ending point of query interval contains matching value, compute s3[stop][value].prev - s3[start][value].next + 1. If it is greater than half of the query interval, return value. If cumulative frequencies are used instead of next/previous indexes, compute s3[stop+1][value].freq - s3[start][value].freq instead.
  4. If nothing found on step 3, return "Nothing".

Main part of the algorithm is getting majority intervals from list of positions:

  1. Assign weight to each position in the list: number_of_matching_values_to_the_left - number_of_nonmatching_values_to_the_left.
  2. Filter only weights in strictly decreasing order (greedily) to the "prefix" array: for (auto x: positions) if (x < prefix.back()) prefix.push_back(x);.
  3. Filter only weights in strictly increasing order (greedily, backwards) to the "suffix" array: reverse(positions); for (auto x: positions) if (x > suffix.back()) suffix.push_back(x);.
  4. Scan "prefix" and "suffix" arrays together and find intervals from every "prefix" element to corresponding place in "suffix" array and from every "suffix" element to corresponding place in "prefix" array. (If all "suffix" elements' weights are less than given "prefix" element or their position is not to the right of it, no interval generated; if there is no "suffix" element with exactly the weight of given "prefix" element, get nearest "suffix" element with larger weight and extend interval with this weight difference to the right).
  5. Merge overlapping intervals.

Properties 1 .. 3 for majority intervals are guaranteed by this algorithm. As for property #4, the only way I could imagine to cover some element with maximum number of majority intervals is like this: 11111111222233455666677777777. Here element 4 is covered by 2 * log n intervals, so this property seems to be satisfied. See more formal proof of this property at the end of this post.

Example:

For input array "0 1 2 0 0 1 1 0" the following lists of positions would be generated:

value  positions
    0  0 3 4 7
    1  1 5 6
    2  2

Positions for value 0 will get the following properties:

weights:    0:1 3:0 4:1 7:0
prefix:     0:1 3:0          (strictly decreasing)
suffix:     4:1 7:0          (strictly increasing when scanning backwards)
intervals:  0->4 3->7 4->0 7->3
merged intervals: 0-7

Positions for value 1 will get the following properties:

weights:    1:0  5:-2  6:-1
prefix:     1:0  5:-2
suffix:     1:0  6:-1
intervals:  1->none 5->6+1 6->5-1 1->none
merged intervals: 4-7

Query data structure:

positions value next prev
        0     0    0    x
     1..2     0    1    0
        3     0    1    1
        4     0    2    2
        4     1    1    x
        5     0    3    2
    ...

Query [0,4]:

prev[4][0]-next[0][0]+1=2-0+1=3
query size=5
3>2.5, returned result 0

Query [2,5]:

prev[5][0]-next[2][0]+1=2-1+1=2
query size=4
2=2, returned result "none"

Note that there is no attempt to inspect element "1" because its majority interval does not include either of these intervals.

Proof of property #4:

Majority intervals are constructed in such a way that strictly more than 1/3 of all their elements have corresponding value. This ratio is nearest to 1/3 for sub-arrays like any*(m-1) value*m any*m, for example, 01234444456789.

To make this proof more obvious, we could represent each interval as a point in 2D: every possible starting point represented by horizontal axis and every possible ending point represented by vertical axis (see diagram below).

enter image description here

All valid intervals are located on or above diagonal. White rectangle represents all intervals covering some array element (represented as unit-size interval on its lower right corner).

Let's cover this white rectangle with squares of size 1, 2, 4, 8, 16, ... sharing the same lower right corner. This divides white area into O(log n) areas similar to yellow one (and single square of size 1 containing single interval of size 1 which is ignored by this algorithm).

Let's count how many majority intervals may be placed into yellow area. One interval (located at the nearest to diagonal corner) occupies 1/4 of elements belonging to interval at the farthest from diagonal corner (and this largest interval contains all elements belonging to any interval in yellow area). This means that smallest interval contains strictly more than 1/12 values available for whole yellow area. So if we try to place 12 intervals to yellow area, we have not enough elements for different values. So yellow area cannot contain more than 11 majority intervals. And white rectangle cannot contain more than 11 * log n majority intervals. Proof completed.

11 * log n is overestimation. As I said earlier, it's hard to imagine more than 2 * log n majority intervals covering some element. And even this value is much greater than average number of covering majority intervals.

C++11 implementation. See it either at ideone or here:

#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <functional>
#include <random>

constexpr int SrcSize = 1000000;
constexpr int NQueries = 100000;

using src_vec_t = std::vector<int>;
using index_vec_t = std::vector<int>;
using weight_vec_t = std::vector<int>;
using pair_vec_t = std::vector<std::pair<int, int>>;
using index_map_t = std::map<int, index_vec_t>;
using interval_t = std::pair<int, int>;
using interval_vec_t = std::vector<interval_t>;
using small_map_t = std::vector<std::pair<int, int>>;
using query_vec_t = std::vector<small_map_t>;

constexpr int None = -1;
constexpr int Junk = -2;

src_vec_t generate_e()
{ // good query length = 3
    src_vec_t src;
    std::random_device rd;
    std::default_random_engine eng{rd()};
    auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);

    for (int i = 0; i < SrcSize; ++i)
    {
        int x = exp();
        src.push_back(x);
        //std::cout << x << ' ';
    }

    return src;
}

src_vec_t generate_ep()
{ // good query length = 500
    src_vec_t src;
    std::random_device rd;
    std::default_random_engine eng{rd()};
    auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);
    auto poisson = std::bind(std::poisson_distribution<int>{100}, eng);

    while (int(src.size()) < SrcSize)
    {
        int x = exp();
        int n = poisson();

        for (int i = 0; i < n; ++i)
        {
            src.push_back(x);
            //std::cout << x << ' ';
        }
    }

    return src;
}

src_vec_t generate()
{
    //return generate_e();
    return generate_ep();
}

int trivial(const src_vec_t& src, interval_t qi)
{
    int count = 0;
    int majorityElement = 0; // will be assigned before use for valid args

    for (int i = qi.first; i <= qi.second; ++i)
    {
        if (count == 0)
            majorityElement = src[i];

        if (src[i] == majorityElement) 
           ++count;
        else 
           --count;
    }

    count = 0;
    for (int i = qi.first; i <= qi.second; ++i)
    {
        if (src[i] == majorityElement)
            count++;
    }

    if (2 * count > qi.second + 1 - qi.first)
        return majorityElement;
    else
        return None;
}

index_map_t sort_ind(const src_vec_t& src)
{
    int ind = 0;
    index_map_t im;

    for (auto x: src)
        im[x].push_back(ind++);

    return im;
}

weight_vec_t get_weights(const index_vec_t& indexes)
{
    weight_vec_t weights;

    for (int i = 0; i != int(indexes.size()); ++i)
        weights.push_back(2 * i - indexes[i]);

    return weights;
}

pair_vec_t get_prefix(const index_vec_t& indexes, const weight_vec_t& weights)
{
    pair_vec_t prefix;

    for (int i = 0; i != int(indexes.size()); ++i)
        if (prefix.empty() || weights[i] < prefix.back().second)
            prefix.emplace_back(indexes[i], weights[i]);

    return prefix;
}

pair_vec_t get_suffix(const index_vec_t& indexes, const weight_vec_t& weights)
{
    pair_vec_t suffix;

    for (int i = indexes.size() - 1; i >= 0; --i)
        if (suffix.empty() || weights[i] > suffix.back().second)
            suffix.emplace_back(indexes[i], weights[i]);

    std::reverse(suffix.begin(), suffix.end());
    return suffix;
}

interval_vec_t get_intervals(const pair_vec_t& prefix, const pair_vec_t& suffix)
{
    interval_vec_t intervals;
    int prev_suffix_index = 0; // will be assigned before use for correct args
    int prev_suffix_weight = 0; // same assumptions

    for (int ind_pref = 0, ind_suff = 0; ind_pref != int(prefix.size());)
    {
        auto i_pref = prefix[ind_pref].first;
        auto w_pref = prefix[ind_pref].second;

        if (ind_suff != int(suffix.size()))
        {
            auto i_suff = suffix[ind_suff].first;
            auto w_suff = suffix[ind_suff].second;

            if (w_pref <= w_suff)
            {
                auto beg = std::max(0, i_pref + w_pref - w_suff);

                if (i_pref < i_suff)
                    intervals.emplace_back(beg, i_suff + 1);

                if (w_pref == w_suff)
                    ++ind_pref;

                ++ind_suff;
                prev_suffix_index = i_suff;
                prev_suffix_weight = w_suff;
                continue;
            }
        }

        // ind_suff out of bounds or w_pref > w_suff:
        auto end = prev_suffix_index + prev_suffix_weight - w_pref + 1;
        // end may be out-of-bounds; that's OK if overflow is not possible
        intervals.emplace_back(i_pref, end);
        ++ind_pref;
    }

    return intervals;
}

interval_vec_t merge(const interval_vec_t& from)
{
    using endpoints_t = std::vector<std::pair<int, bool>>;
    endpoints_t ep(2 * from.size());

    std::transform(from.begin(), from.end(), ep.begin(),
                   [](interval_t x){ return std::make_pair(x.first, true); });

    std::transform(from.begin(), from.end(), ep.begin() + from.size(),
                   [](interval_t x){ return std::make_pair(x.second, false); });

    std::sort(ep.begin(), ep.end());

    interval_vec_t to;
    int start; // will be assigned before use for correct args
    int overlaps = 0;

    for (auto& x: ep)
    {
        if (x.second) // begin
        {
            if (overlaps++ == 0)
                start = x.first;
        }
        else // end
        {
            if (--overlaps == 0)
                to.emplace_back(start, x.first);
        }
    }

    return to;
}

interval_vec_t get_intervals(const index_vec_t& indexes)
{
    auto weights = get_weights(indexes);
    auto prefix = get_prefix(indexes, weights);
    auto suffix = get_suffix(indexes, weights);
    auto intervals = get_intervals(prefix, suffix);
    return merge(intervals);
}

void update_qv(
    query_vec_t& qv,
    int value,
    const interval_vec_t& intervals,
    const index_vec_t& iv)
{
    int iv_ind = 0;
    int qv_ind = 0;
    int accum = 0;

    for (auto& interval: intervals)
    {
        int i_begin = interval.first;
        int i_end = std::min<int>(interval.second, qv.size() - 1);

        while (iv[iv_ind] < i_begin)
        {
            ++accum;
            ++iv_ind;
        }

        qv_ind = std::max(qv_ind, i_begin);

        while (qv_ind <= i_end)
        {
            qv[qv_ind].emplace_back(value, accum);

            if (iv[iv_ind] == qv_ind)
            {
                ++accum;
                ++iv_ind;
            }

            ++qv_ind;
        }
    }
}

void print_preprocess_stat(const index_map_t& im, const query_vec_t& qv)
{
    double sum_coverage = 0.;
    int max_coverage = 0;

    for (auto& x: qv)
    {
        sum_coverage += x.size();
        max_coverage = std::max<int>(max_coverage, x.size());
    }

    std::cout << "             size = " << qv.size() - 1 << '\n';
    std::cout << "           values = " << im.size() << '\n';
    std::cout << "     max coverage = " << max_coverage << '\n';
    std::cout << "     avg coverage = " << sum_coverage / qv.size() << '\n';
}

query_vec_t preprocess(const src_vec_t& src)
{
    query_vec_t qv(src.size() + 1);
    auto im = sort_ind(src);

    for (auto& val: im)
    {
        auto intervals = get_intervals(val.second);
        update_qv(qv, val.first, intervals, val.second);
    }

    print_preprocess_stat(im, qv);
    return qv;
}

int do_query(const src_vec_t& src, const query_vec_t& qv, interval_t qi)
{
    if (qi.first == qi.second)
        return src[qi.first];

    auto b = qv[qi.first].begin();
    auto e = qv[qi.second + 1].begin();

    while (b != qv[qi.first].end() && e != qv[qi.second + 1].end())
    {
        if (b->first < e->first)
        {
            ++b;
        }
        else if (e->first < b->first)
        {
            ++e;
        }
        else // if (e->first == b->first)
        {
            // hope this doesn't overflow
            if (2 * (e->second - b->second) > qi.second + 1 - qi.first)
                return b->first;

            ++b;
            ++e;
        }
    }

    return None;
}

int main()
{
    std::random_device rd;
    std::default_random_engine eng{rd()};
    auto poisson = std::bind(std::poisson_distribution<int>{500}, eng);
    int majority = 0;
    int nonzero = 0;
    int failed = 0;

    auto src = generate();
    auto qv = preprocess(src);

    for (int i = 0; i < NQueries; ++i)
    {
        int size = poisson();
        auto ud = std::uniform_int_distribution<int>(0, src.size() - size - 1);
        int start = ud(eng);
        int stop = start + size;
        auto res1 = do_query(src, qv, {start, stop});
        auto res2 = trivial(src, {start, stop});
        //std::cout << size << ": " << res1 << ' ' << res2 << '\n';

        if (res2 != res1)
            ++failed;

        if (res2 != None)
        {
            ++majority;

            if (res2 != 0)
                ++nonzero;
        }
    }

    std::cout << "majority elements = " << 100. * majority / NQueries << "%\n";
    std::cout << " nonzero elements = " << 100. * nonzero / NQueries << "%\n";
    std::cout << "          queries = " << NQueries << '\n';
    std::cout << "           failed = " << failed << '\n';

    return 0;
}

Related work:

As pointed in other answer to this question, there is other work where this problem is already solved: "Range majority in constant time and linear space" by S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala.

Algorithm presented in this paper has better asymptotic complexities for query time: O(1) instead of O(log n) and for space: O(n) instead of O(n log n).

Better space complexity allows this algorithm to process larger data sets (comparing to the algorithm proposed in this answer). Less memory needed for preprocessed data and more regular data access pattern, most likely, allow this algorithm to preprocess data more quickly. But it is not so easy with query time...

Let's suppose we have input data most favorable to algorithm from the paper: n=1000000000 (it's hard to imagine a system with more than 10..30 gigabytes of memory, in year 2013).

Algorithm proposed in this answer needs to process up to 120 (or 2 query boundaries * 2 * log n) elements for each query. But it performs very simple operations, similar to linear search. And it sequentially accesses two contiguous memory areas, so it is cache-friendly.

Algorithm from the paper needs to perform up to 20 operations (or 2 query boundaries * 5 candidates * 2 wavelet tree levels) for each query. This is 6 times less. But each operation is more complex. Each query for succinct representation of bit counters itself contains a linear search (which means 20 linear searches instead of one). Worst of all, each such operation should access several independent memory areas (unless query size and therefore quadruple size is very small), so query is cache-unfriendly. Which means each query (while is a constant-time operation) is pretty slow, probably slower than in algorithm proposed here. If we decrease input array size, increased are the chances that proposed here algorithm is quicker.

Practical disadvantage of algorithm in the paper is wavelet tree and succinct bit counter implementation. Implementing them from scratch may be pretty time consuming. Using a pre-existing implementation is not always convenient.

Community
  • 1
  • 1
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • I can't seem to get this algorithm, especially the part on transforming the list of positions into list of majority intervals (I don't even get what a majority interval is). By "number of matching values" you meant "count of elements with the same value"? "weights in decreasing order" -> is it strictly decreasing? I can't understand your step 4 in *Main part*. Probably can you give one example run on "0 1 2 0 0 1 1 0" with the query [0,4] and [2,5]? – justhalf Nov 08 '13 at 01:37
  • @justhalf: Added more explanations and an example. Hope now it's easier to understand. Your guess is correct in both cases: `you meant "count of elements with the same value"?` and `is it strictly decreasing?`; both answers are "yes". – Evgeny Kluev Nov 08 '13 at 08:10
  • Your third data structure seems to be used only to get the count of each value in a specific interval. In that case, why don't just describe it as array of cumulative frequency? Then we can calculate the frequency at interval [a,b] by calculating cumulative[b][0]-cumulative[a][0]. More intuitive than using "next" and "prev" (which I still don't understand completely). So, your idea lies mainly in the property 4, right? Then I think we should try to prove it, so that we can be sure about the time complexity. =) – justhalf Nov 08 '13 at 08:40
  • 1
    @justhalf: I agree, it might be described (almost) as array of cumulative frequencies. Instead of next/prev this will require a single frequency value and an additional bit needed to adjust the result: if current entry's value is equal to source array's value (or just compare them at query time). As for property #4, for now I have nothing better than pseudo-proof in the answer. Of course I would add the proof **if** I manage to get it. Or maybe we'll get the help from SO community... – Evgeny Kluev Nov 08 '13 at 09:03
  • 1
    Eh, wait. If you use array of cumulative frequencies, there will be `O(n^2)` entries to be filled, right? – justhalf Nov 08 '13 at 15:50
  • @justhalf: No. Exactly the same amount as for prev/next. O(n log n). – Evgeny Kluev Nov 08 '13 at 15:53
  • Hmm, the `O(n log n)` also comes from the fact that there can only be `O(n log n)` intervals covering an element, is it? However I still can't get your algorithm fully, sorry =( But I get the general idea. Perhaps it's essentially similar to the paper pointed out by flup. The paper claimed to answer it in constant time (actually it's log of alphabet size, which I consider as `O(log n)`). I'm torn between giving the bounty to this answer, or the answer by flup, since it's simpler. – justhalf Nov 11 '13 at 02:11
  • 1
    @justhalf: Actually only `O(log n)` intervals cover an element, but there are `n` elements, which gives `O(n log n)` space/preprocessing. I agree, my skills of explaining the algorithm are not so good (comparing to quality of the paper). To compensate this I provided C++ implementation. As for advantages of algorithm in the paper, most of all I like smaller space complexity and simpler proofs. I don't think this algorithm would be simpler to implement because of advanced data structures (wavelet tree and something similar to segment tree). – Evgeny Kluev Nov 11 '13 at 12:27
  • Ah, yes, sorry, I meant `O(log n)` per element, typo. You've given great answer, though. =) I'll give the bounty to you tomorrow if no other great answers are present. – justhalf Nov 12 '13 at 01:49
  • Thank you flup for the insight. Definitely the one in the paper would be the best answer to this question. Evgeny had shown also his research part by going through the painstaking process of proving his algorithm, and he also provided a C++ implementation of his idea. So I guess I'll give the bounty to this answer. Although I believe there could be better implementation to your "`O(n)` worst case" frequency counting, your answer is also good, flup! – justhalf Nov 13 '13 at 01:43
  • I think your discussion of the algorithm by Durocher et al. is overly pessimistic. First, the quadruple blocks will usually have fewer than 5 candidates. (Remark 1 in the paper) Second, since the other elements are irrelevant, the alphabets of the quadruple block's wavelet trees can be reduced to the number of candidates for that block plus one. So the wavelet tree's alphabet size is not sigma, which is 20 or 60 or 1000000, but sigma' = #candidates+1 <= 6. They really mean it when they say O(1). Third, there is a c++ library available with an implementation of several flavours of WaveletTrees. – flup Nov 13 '13 at 01:44
  • This algorithm also uses a hashmap and suffers from the same worst case performance when verifying the candidates I think? Or am I misunderstanding? – flup Nov 13 '13 at 01:45
  • @flup: Right, I overestimated wavelet trees depth, now it is corrected. Thanks for pointing that. And thanks for finding this excellent paper. As for `fewer than 5 candidates`, here I compare worst-case time for both algorithms; my algorithm also usually gives fewer than `log n` candidates. Only measurements could say which algorithm does queries faster. I only give some arguments why I think that my algorithm has real chances here. Hashmap is only an optional improvement. Even if it is used, candidate verification does not depend on it (it depends only on a pair of short sorted arrays). – Evgeny Kluev Nov 13 '13 at 10:32
9

the trick

When looking for a majority element, you may discard intervals that do not have a majority element. See Find the majority element in array. This allows you to solve this quite simply.

preparation

At preparation time, recursively keep dividing the array into two halves and store these array intervals in a binary tree. For each node, count the occurrence of each element in the array interval. You need a data structure that offers O(1) inserts and reads. I suggest using an unsorted_multiset, which on average behaves as needed (but worst case inserts are linear). Also check if the interval has a majority element and store it if it does.

runtime

At runtime, when asked to compute the majority element for a range, dive into the tree to compute the set of intervals that covers the given range exactly. Use the trick to combine these intervals.

If we have array interval 7 5 5 7 7 7, with majority element 7, we can split off and discard 5 5 7 7 since it has no majority element. Effectively the fives have gobbled up two of the sevens. What's left is an array 7 7, or 2x7. Call this number 2 the majority count of the majority element 7:

The majority count of a majority element of an array interval is the occurrence count of the majority element minus the combined occurrence of all other elements.

Use the following rules to combine intervals to find the potential majority element:

  • Discard the intervals that have no majority element
  • Combining two arrays with the same majority element is easy, just add up the element's majority counts. 2x7 and 3x7 become 5x7
  • When combining two arrays with different majority elements, the higher majority count wins. Subtract the lower majority count from the higher to find the resulting majority count. 3x7 and 2x3 become 1x7.
  • If their majority elements are different but have have equal majority counts, disregard both arrays. 3x7 and 3x5 cancel each other out.

When all intervals have been either discarded or combined, you are either left with nothing, in which case there is no majority element. Or you have one combined interval containing a potential majority element. Lookup and add this element's occurrence counts in all array intervals (also the previously discarded ones) to check if it really is the majority element.

example

For the array 1,1,1,2,2,3,3,2,2,2,3,2,2, you get the tree (majority count x majority element listed in brackets)

                        1,1,1,2,2,3,3,2,2,2,3,2,2    
                                  (1x2)
                      /                           \
             1,1,1,2,2,3,3                       2,2,2,3,2,2
                                                    (4x2)
            /              \                   /            \
        1,1,1,2           2,3,3            2,2,2             3,2,2
         (2x1)            (1x3)            (3x2)             (1x2)
        /     \          /    \            /    \            /    \
     1,1      1,2       2,3     3        2,2     2        3,2      2
    (1x1)                     (1x3)     (2x2)  (1x2)             (1x2)
    /   \     /  \     /   \            /  \             /   \
   1     1   1   2    2    3           2    2           3     2
(1x1) (1x1)(1x1)(1x2)(1x2)(1x3)       (1x2)(1x2)       (1x3) (1x2)     

Range [5,10] (1-indexed) is covered by the set of intervals 2,3,3 (1x3), 2,2,2 (3x2). They have different majority elements. Subtract their majority counts, you're left with 2x2. So 2 is the potential majority element. Lookup and sum the actual occurrence counts of 2 in the arrays: 1+3 = 4 out of 6. 2 is the majority element.

Range [1,10] is covered by the set of intervals 1,1,1,2,2,3,3 (no majority element) and 2,2,2 (3x2). Disregard the first interval since it has no majority element, so 2 is the potential majority element. Sum the occurrence counts of 2 in all intervals: 2+3 = 5 out of 10. There is no majority element.

Community
  • 1
  • 1
flup
  • 26,937
  • 7
  • 52
  • 74
  • @justhalf Back from the drawing board, I think this is it! – flup Nov 07 '13 at 23:48
  • I believe so =D. Let's see whether there is any further objections on your answer =) – justhalf Nov 08 '13 at 01:19
  • I'm sorry, but I think this is incorrect for the case "1,5,7,7,7,7,8,4,6" (your sample array with modification at position 3) with the query [3,9] (1-based). For position [3,5], it will get three occurrences of "7". But for [6,9], it will return nothing, and as you don't store the frequency of "7" in the range [6,9], you can't deduce that "7" is actually the majority element without scanning through the array. =( – justhalf Nov 08 '13 at 06:36
  • @justhalf Ha! I've heavily edited cause It was simpler still! No need to store 2 candidates per array, saving majority elements only suffices. – flup Nov 08 '13 at 23:53
  • Can you describe how to do `For each node, count the occurrence of each element in the array interval` in `O(n log n)`? That's the crucial part. =) – justhalf Nov 11 '13 at 02:02
  • You can iterate over the array, and for each element dive into the tree, updating this element's counter for each node you encounter. That is O(n log n) updates. You need a datastructure that allows O(1) updates and reads. The naive choices are a 2D array or an unordered_multiset. The time complexity of `calloc`ing an array of size k*n is OS-dependent and might actually be o(1). An unordered_multiset is o(1) to instantiate and offers _on average_ o(1) update and read (but worst case is linear). A fancier datastructure like the wavelet tree would give you more reliable performance. – flup Nov 11 '13 at 09:54
  • I think you should include that into your answer, since that constitutes the crucial part of `O(n log n)` preprocessing step. – justhalf Nov 11 '13 at 10:35
  • Added it, I think I'd go for the hash set. – flup Nov 12 '13 at 08:02
3

Actually, it can be done in constant time and linear space(!)

See https://cs.stackexchange.com/questions/16671/range-majority-queries-most-freqent-element-in-range and S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala, Range majority in constant time and linear space, Information and Computation 222 (2013) 169–179, Elsevier.

Their preparation time is O(n log n), the space needed is O(n) and queries are O(1). It is a theoretical paper and I don't claim to understand all of it but it seems far from impossible to implement. They're using wavelet trees.

For an implementation of wavelet trees, see https://github.com/fclaude/libcds

Community
  • 1
  • 1
flup
  • 26,937
  • 7
  • 52
  • 74
  • I think it's essentially the same as the answer by Evgeny Kluev. That is, to limit the number of candidates, then count the frequency of those candidates using some data structure. Thank you for pointing out the paper! =D – justhalf Nov 11 '13 at 01:55
  • Essentially, I think it wipes the floor with the other answers since it needs less space and comes up with an answer way quicker. The only problem is that there is a gap to be bridged between theory and practice. Must implement it. – flup Nov 12 '13 at 08:00
-1

If you have unlimited memory you can and limited data range (like short int) do it even in O(N) time.

  1. Go through array and count number of 1s, 2s, 3s, eta (number of entries for each value you have in array). You will need additional array X with sizeof(YouType) elements for this.
  2. Go through array X and find maximum.

In total O(1) + O(N) operations.


Also you can limit yourself with O(N) memory, if you use map instead of array X. But then you will need to find element on each iteration at stage 1. Therefore you will need O(N*log(N)) time in total.

klm123
  • 12,105
  • 14
  • 57
  • 95
  • Although this is _not at all_ what the OP asked for, it's what I would suggest too, since it's a sensible approach. OP wants a smaller subrange from _at most_ 1 million elements (from the examples, more likely something like a range of 5-10 elements), so we're talking of at most 4MiB of memory (= neglegible), or rather some dozen bytes in the real case (= even more neglegible). Going linearly over 1 million elements once is also ridiculously fast, regardless of big-O, and finding the biggest of a dozen numbers (or even a million) is also ridiculously fast, really no need to think of big-O. – Damon Nov 07 '13 at 14:08
  • A more "clever" algorithm that only needs to magically access log(n) elements (how would that work? Either preparation or calculation must necessarily look at every element at least once if the numbers aren't sorted... so you must have a O(N) somewhere) likely won't be any faster in reality, due to cache effects. – Damon Nov 07 '13 at 14:10
  • @Damon, Well. It is hard to understand for me what exactly OP wants. Why it not at all what the OP asks if he talks about small subranges? – klm123 Nov 07 '13 at 14:50
  • @Damon, I can't distinguish between preparations and the job itself here too. I have asked the OP about this. – klm123 Nov 07 '13 at 14:51
  • Well, the OP wants O(log n) with at most O(n log n) preparation time, and yours is O(n) and O(m), respectively. Though I believe that what the OP wants isn't possible at all. If the elements are not already sorted or in a tree structure or such, how would that work without looking at every element _at least once_? There's room to improve on your O(N) approach, however. One could adapt Boyer and Moore's majority vote algorithm to trivially check whether an element is in the requested range (and skip over it, if it isn't). That would be one single pass (i.e. O(n)) without preparation. – Damon Nov 07 '13 at 15:35
  • I didn't call any part of my algorithm "preparation". I can say that It takes O(N) preparation time and 0 work time. How do you distinguish preparation and work time if you have only one set of data and ask only for one value of information? – klm123 Nov 07 '13 at 15:48
-1

You can use MAX Heap, with frequency of number as a deciding factor for Keeping Max Heap property, I meant, e.g. for following input array

1 5 2 7 7 7 8 4 6 5

Heap would have all distinct elements with their frequency associated with them
    Element = 1  Frequency = 1,
    Element = 5  Frequency = 2,
    Element = 2  Frequency = 1,
    Element = 7  Frequency = 3,
    Element = 8  Frequency = 1,
    Element = 4  Frequency = 1,
    Element = 6  Frequency = 1

As its MAX heap, Element 7 with frequency 3 would be at the root level, Just check whether input range contains this element, if yes then this is the answer if no, then go to left subtree or right subtree as per input range and perform same checks.

O(N) would be required only once while creating a heap, but once its created, searching will be efficient.

Shrikant
  • 744
  • 8
  • 18
-3

Edit: Sorry, I was solving a different problem.

Sort the array and build an ordered list of pairs (value, number_of_occurrences) - it's O(N log N). Starting with

1 5 2 7 7 7 8 4 6

it will be

(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)

On top of this array, build a binary tree with pairs (best_value_or_none, max_occurrences). It will look like:

(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)
   \   /       \   /       \  /       |
   (0,1)       (0,1)       (7,3)    (8,1)
        \     /                 \   /
         (0,1)                  (7,3)
              \                /
                     (7,3)

This structure definitely has a fancy name, but I don't remember it :)

From here, it's O(log N) to fetch the mode of any interval. Any interval can be split into O(log N) precomputed intervals; for example:

[4, 7] = [4, 5] + [6, 7]
f([4,5]) = (0,1)
f([6,7]) = (7,3)

and the result is (7,3).

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Maxim Razin
  • 9,114
  • 7
  • 34
  • 33
  • How does knowing the mode of two halves of a range tell you anything about the mode of the whole range? – Rob Nov 03 '13 at 16:42
  • Well, first of all, when I sort an array, my queried intervals are practically gone. Secondly, are you sure it will return right results every time? – Krzysztofik Nov 03 '13 at 16:45
  • It is not about intervals of indexes of the original sequence, it's interval of values. – Maxim Razin Nov 03 '13 at 17:28
  • @grep: Are you sure about that? I don't see that in the question. – Beta Nov 07 '13 at 06:48