Find mode of a multiset in given time bound (most multiplicity)

Question

The given problem:

A multiset is a set in which some of the elements occur more then once (e.g. {a, f, b, b, e, c, b, g, a, i, b} is a multiset). The elements are drawn from a totally ordered set. Present an algorithm, when presented with a multiset as input, finds an element that has the most occurrences in the multiset (e.g. in {a, f, b, b, e, c, b, g, a, c, b}, b has the most occurrences). The algorithm should run in O(n lg n/M +n) time, where n is the number of elements in the multiset and M is the highest number of occurrences of an element in the multiset. Note that you do not know the value of M.

[Hint: Use a divide-and-conquer strategy based on the median of the list. The subproblems generated by the divide-and-conquer strategy cannot be smaller than a ‘certain’ size in order to achieve the given time bound.]

Our initial solution:

Our idea was to use Moore's majority algorithm to determine if the multiset contained a majority candidate (eg. {a, b, b} has a majority, b). After determining if this was true or false we either output the result or find the median of the list using a given algorithm (known as Select) and split the list into three sublists (elements less than and equal to the median, and elements greater than the median). Again, we would check each of the lists to determine if the majority element was present, if so, that is your result.

For example, given the multiset {a, b, c, d, d, e, f}

Step 1: check for majority. None found, split the list based on the median.

Step 2: L1 = {a, b, c, d, d}, L2 = {e, f} Find the majority of each. None found, split the lists again.

Step 3: L11 = {a, b, c} L12 = {d, d} L21 = {e} L22 = {f} Check each for majority elements. L12 returns d. In this case, d is the most occurring elements in the original multiset, thus is the answer.

The issues we're having are whether this type of algorithm is fast enough, as well as whether this can be done recursively or if a loop that terminates is required. Like the hint says, the sub-problems cannot be smaller than a 'certain' size, which we believe to be M (the most occurrences).

score 1 · Answer 1 · answered Mar 04 '15 at 19:39

If you want to do this in real life, it is worth considering using a hash table to track the counts. This can have amortized O(1) complexity per hash table access, so the overall complexity of the following Python code is O(n).

import collections
C = collections.Counter(['a','f','b','b','e','c','b','g','a','i','b'])
most_common_element, highest_count = C.most_common(1)[0]

score 1 · Accepted Answer · answered Mar 04 '15 at 19:42

1

If you use recursion in a most straight-forward way as described in your post, it will not have the desired time complexity. Why? Let's assume that the answer element is the largest one. Then it is always located in the right branch of recursion. But we call the left branch first, which can go much deeper if all elements are distinct there(getting pieces of size 1, while we do not want to get them smaller than M).

Here is a correct solution:

Let's always split the array into three parts at each step as described in your question. Now let's step aside and take a look at what we have: recursive calls form a tree. To get the desired time complexity, we should never go deeper than the level where the answer is located. To achieve this, we can traverse the tree using a breadth-first search with queue instead of a depth-first search. That's it.

answered Mar 04 '15 at 19:42

kraskevich

18,368
4
33
45

Can you expand on your solution? Because we don't know the value of M (the most occurrences) how would we continue down the tree until the answer is found? For example, use input { a b c d d e f }. So for this set, are we still considering Moores majority algorithm first? If theres no majority element in the set, split the list based on the median into three lists, lower, median and upper? – NoGoodAlgorithm Mar 04 '15 at 20:26
@MichaelSalvador Yes. But if the left part does not contain a majority element, instead of splitting it we process the middle and the upper part first. Only after it we go to the next layer. – kraskevich Mar 04 '15 at 20:36
I think I follow, but Im curious what would happen in the scenario where we have a list like so {a b c d d e e e}. The median would be computed as 'd', which would cause three lists {a b c} {d d} and {e e e} to be generated on the next level of our tree. Now the majority algorithm will return d for the median list, although d is not our correct answer. e will also be returned once that part of the list is examined. The tree will not continue because of these two elements. Is there another comparison to be done at this point? Or am I going about this incorrectly? – NoGoodAlgorithm Mar 04 '15 at 20:44
@MichaelSalvador We need to need keep exploring other parts until all that remains is definitely smaller. So the last one should be checked, then it is over in this case. – kraskevich Mar 04 '15 at 20:54
Right but in the case where 'd' and 'e' are both returned, do we have to consider the length of each list? How do we know the answer is e and not d? The majority algorithm will return both, because in that case, both d and e are the majority elements of each sub-list. – NoGoodAlgorithm Mar 04 '15 at 21:03
@MichaelSalvador In this case we need to check which of them has more occurrences. – kraskevich Mar 04 '15 at 21:04

Find mode of a multiset in given time bound (most multiplicity)

2 Answers2