Maximum repeating substring of size n

Question

Find the substring of length n that repeats a maximum number of times in a given string.

Input: abbbabbbb# 2
Output: bb

My solution:

public static String mrs(String s, int m) {
    int n  = s.length();
    String[] suffixes = new String[n-m+1];
    for (int i = 0; i < n-m+1; i++) {
        suffixes[i] = s.substring(i, i+m);
    }
    Arrays.sort(suffixes);
    String ans = "", tmp=suffixes[0].substring(0,m);
    int cnt = 1, max=0;
    for (int i = 0; i < n-m; i++) {
        if (suffixes[i].equals(suffixes[i+1])){
            cnt++;
        }else{
            if(cnt>max){
                max = cnt;
                ans =tmp;
            }
            cnt=0;
            tmp = suffixes[i];
        }
    }
    return ans;
}

Can it be done better than the above O(nm) time and O(n) space solution?

Nice problem. Cup of tea while we solve it for you, Sir or Ma'am? Show what you've tried so far to solve the problem on your own. — , Feb 11 '17 at 02:32
Have a look at [this question](http://stackoverflow.com/questions/38372159/longest-maximum-repeating-substring). Doesn't solve the exact same problem you have, but it's easily transformable. Or as an alternative this [wikipedia article](https://en.wikipedia.org/wiki/Longest_repeated_substring_problem) also provides a solution. — , Feb 11 '17 at 14:15
@Paul In the question that you have linked to, we have to find the longest maximum repeating substring. For example, for abcefghabcefghabcabc the characters a,b,c repeat 4 times, therefore the longest substring that repeats a maximum number of times is abc of length 3. It is very simple to convert the solution to the above question as the answer to my question when m= 1 or 2 or 3 considering the above example. What if I have to find the maximum repeating substring of length 4? That is when m is greater than the longest maximum repeating substring length? The answer now would be efgh. — Leo18, Feb 11 '17 at 15:10
@Leo18 pretty simple: the solution accepts substrings of any length. Just add a constraint on the length and you've got your solution. On the other hand the answer introduces the constraint that each character may appear at most once in the maximum-substring. This constraint isn't present in your case. The algorithm traverses a suffix-tree and takes the path with highest count. Your algo would introduce the requirement for a specific path-length. — , Feb 11 '17 at 15:27
In the second iteration of the for loop in that solution, they have considered only the characters that equal the max frequency. Should I now consider all characters? As a,b,c will only equal the max frequency of 3. e,f,g,h have a frequency of 2. — Leo18, Feb 11 '17 at 16:18
@Paul Also, I am not able to think how to modify that solution to include non-unique characters in the answer. Can you please give a code snippet or a more detailed answer. Thank you! — Leo18, Feb 11 '17 at 16:45
I made a small mistake with the example that I provided. Correction: Input :abcefghabcefghabcabc 4 Output:abce/efgh. Input:abcefghabchefghabcabc 4 Output: efgh. — Leo18, Feb 11 '17 at 16:56
@Leo18 this answer is locked, so: no, I cant. And you can't modify that code to work for your problem. You can only use the same approach, but you'll have to write your own code based on it. — , Feb 11 '17 at 17:02
Okay, I will try to use that approach. But, can you reopen this question as both are not necessarily the same problem and this may have a better approach to solve than suggested in that post. — Leo18, Feb 11 '17 at 17:38
An O(n) solution is to use [rolling hash](https://en.wikipedia.org/wiki/Rolling_hash). — Gassa, Feb 11 '17 at 21:30
@Gassa that's not even `O(n)` for counting a single pattern in another string. You should look up [Rabin-Karp-algorithm](https://en.wikipedia.org/wiki/Rabin-karp) as that's what you actually mean, I guess. Hashing can give false positives, so this is a neat trick to speed up things a bit, but it's nowhere close to improving the time-complexity. — , Feb 14 '17 at 01:19
@Paul Thanks, I know the algorithm. And you are right, it's not O(n) but O(length-of-input), I've messed up the symbols, sorry! — Gassa, Feb 14 '17 at 09:52
@Gassa its neither O(length-of-input). Hashing can only rule out false positives, so youll have to compare substrings with equal hash. This is an improvement in performance, but not a solution in linear time. As already pointed out in your answer you didnt take hash-collisions into account — , Feb 14 '17 at 12:26
@Paul I disagree, it actually is O(L), where L is the length of input. Even formally, if we take my answer's approach of re-solving the problem if an error occurs, with probability say P, the expected time is O(L * (1 + P + P^2 + P^3 + ...)) which is still O(L). Realistically though, just solving once in O(L) and keeping the probability low would be enough for many applications. — Gassa, Feb 14 '17 at 12:54

score 0 · Answer 1 · answered Feb 14 '17 at 10:09

For a string of length L and a given length k (not to mess up with n and m which the question interchanges at times), we can compute polynomial hashes of all substrings of length k in O(L) (see Wikipedia for some elaboration on this subproblem).

Now, if we map the hash values to the number of times they occur, we get the value which occurs most frequently in O(L) (with a HashMap with high probability, or in O(L log L) with a TreeMap).

After that, just take the substring which got the most frequent hash as the answer.

This solution does not take hash collisions into account. The idea is to just reduce the probability of collisions enough for the application (if it's too high, use multiple hashes, for example). If the application demands that we absolutely never give a wrong answer, we can check the answer in O(L) with another algorithm (KMP, for example), and re-run the whole solution with a different hash function as long as the answer turns out to be wrong.

Maximum repeating substring of size n

1 Answers1