How can I check for a reoccurring pattern in a string without knowing the actual pattern?

Question

For example, I have a string, "fbrtfuifigfbrt". I want to find whether a sequence of characters reoccurs in a string, but I don't know what that sequence of characters is. In this case , it is fbrt.

I thought about breaking the string into a bunch of individual words and then checking if the words are the same, but that quickly becomes inefficient when parsing a longer string.

For now, I implemented the above idea, but surely there's a better idea.

String s = "fbrtfuifigfbrt";
ArrayList<String> words = new ArrayList<String>(s.length() * s.length());

for(int outerLoop = 0; outerLoop <= s.length(); outerLoop++){
    for(int nestedLoop = 0; nestedLoop <= s.length(); nestedLoop++){
        words.add(fileContents.substring(outerLoop, nestedLoop));
    }
}
//I could dump the ArrayList in a HashSet and check if they are the same size, 
//then find those elements, etc. 
//but that goes along with the above code, and I would prefer to use a more efficient method

Need more info on what qualifies. Would the string "abcdabcdab" match a pattern? How about "xxabcabcabcx"? — nhouser9, Nov 11 '16 at 00:33
@nhouser9 Yes, substrings of both would qualify (*abcd* for the first one, *abc* for the second one). The patterns would have to occur more than once to be a pattern. — ack, Nov 11 '16 at 00:37
I think you need to know the length of the string you want to find — Oscar Martinez, Nov 11 '16 at 00:38
@OscarMartinez I can't know the length of the substring, as it is variable and could be anything. — ack, Nov 11 '16 at 00:40
So for the string "aabbabcabaaaaaa" all of the following substrings would match: "a", "b", "ab", "aa", "aaa". Is that correct? — nhouser9, Nov 11 '16 at 00:42
But imagine "abcdabcd", "a" "ab" "abc" "abcd" "cd" "bcd"... so on. All of these substrings would be candidates — Oscar Martinez, Nov 11 '16 at 00:42
@OscarMartinez Yes, all of those would be potential results. — ack, Nov 11 '16 at 00:44
If you really want to find all those results the way you were already going about it is basically fine. There is no real optimization for this. — nhouser9, Nov 11 '16 at 00:52
Despite having tried without result I believe that doing it **recursively** is the answer — Oscar Martinez, Nov 11 '16 at 01:07
@GusChambers I posted a new answer below. I tested it and it works just fine. — nhouser9, Nov 11 '16 at 01:08
@GusChambers Can you define the minimum length of a pattern? For example single character is not a pattern. So your min length would be 2. — Ozair Kafray, Nov 11 '16 at 07:36
@OzairKafray There is no minimum length of a pattern, for example, "a", as long as it occurred more than once, would be a pattern. — ack, Nov 11 '16 at 12:35
Hi @GusChambers, looking at your comments I can't but guess that you got to solve your problem thanks to the answers and tips given by us. If that's the case, please mark the answer you found most useful as accepted, so the user gets a reward for the time they took in helping you, and your question stops appearing as unanswered. Cheers! — walen, Nov 14 '16 at 10:48

score 1 · Answer 1 · answered Nov 11 '16 at 01:08

There isn't a good optimization for this. You are going to end up with some kind of a brute force solution.

Something like:

String myString = "abcabcbbb";
//for each char
for (int i = 0; i < myString.length(); i++) {
    //for each substring starting with that char
    int maxSubStringLen = Math.floorDiv(myString.length() - i, 2);
    for (int j = 1; j <= maxSubStringLen; j++) {
        //get the substring
        String subString = myString.substring(i, i + j);
        int repetitionIndex = i + j;
        String repetition = myString.substring(repetitionIndex, repetitionIndex + subString.length());

        //does the substring repeat?
        if (subString.equals(repetition)) {
            System.out.println(subString);
        }
    }
}

This simply prints all substrings that mach. You can replace the print statement with whatever you actualyl want to do with them.

Hmmmm I think you misunderstood the question. You're printing substrings that repeat only inmediatly after themselves, like x_abcabc_x. Repetitions further away like x_abc_efg_abc_x are not printed. It would work if you replaced the second `substring` with an `indexOf`, however. — walen, Nov 11 '16 at 08:10

walen · Accepted Answer · 2016-11-11T13:15:27.290

Working solution in Java:

import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        String test1 = "fbrtfuifigfbrt";
        String test2 = "abcdabcd";
        String test3 = "fbrtxibrjkfbrt";
        System.out.println(findRepetitions(test1));
        System.out.println(findRepetitions(test2));
        System.out.println(findRepetitions(test3));
    }

    private static List<String> findRepetitions(String string) {
        List<String> patternsList = new ArrayList<>();
        int length = string.length();
        for (int i = 0; i < length; i++) { // search the first half
            int limit = (length - i) / 2; // candidates can't be longer than half the remaining length
            for (int j = 1; j <= limit; j++) {
                int candidateEndIndex = i + j;
                String candidate = string.substring(i, candidateEndIndex);
                if (string.substring(candidateEndIndex).contains(candidate)) {
                    patternsList.add(candidate);
                }
            }
        }
        return patternsList;
    }
}

Output:

[f, fb, fbr, fbrt, b, br, brt, r, rt, t, f, i, f]
[a, ab, abc, abcd, b, bc, bcd, c, cd, d]
[f, fb, fbr, fbrt, b, br, brt, r, rt, t, b, br, r]

As others already said, there's no easy optimization for this if you don't know the length of the pattern or any other applicable restriction.

If you wanted to naively discard subpatterns like f, fb, fbr which are being counted just because they are substrings of the longest fbrt pattern, you could make the inner for count downwards, from limit down to 1, so you would find longer patterns first, and then check if the next patterns are a substring of already found ones before adding them to the list. Like this:

import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        String test1 = "fbrtfuifigfbrt";
        String test2 = "abcdabcd";
        String test3 = "fbrtxibrjkfbrt"; // "br" is a pattern but this version won't find it
        System.out.println(findRepetitions(test1));
        System.out.println(findRepetitions(test2));
        System.out.println(findRepetitions(test3));
    }

    private static List<String> findRepetitions(String string) {
        List<String> patternsList = new ArrayList<>();
        int length = string.length();
        for (int i = 0; i < length; i++) { // search the first half
            int limit = (length - i) / 2; // candidates can't be longer than half the remaining length
            for (int j = limit; j >= 1; j--) {
                int candidateEndIndex = i + j;
                String candidate = string.substring(i, candidateEndIndex);
                if (string.substring(candidateEndIndex).contains(candidate)) {
                    boolean notASubpattern = true;
                    for (String pattern : patternsList) {
                        if (pattern.contains(candidate)) {
                            notASubpattern = false;
                            break;
                        }
                    }
                    if (notASubpattern) {
                        patternsList.add(candidate);
                    }
                }
            }
        }
        return patternsList;
    }
}

This, however, would prevent you from finding br in fbrtxzbrjkfbrt, as shown by the output (and it'd make the algorithm slower for strings with a lot of different patterns, too):

[fbrt, i]
[abcd]
[fbrt]

Hence the naively part. Of course, you could include more inner loops to make sure to-be-discarded candidates aren't found "on their own" in the original string, before actually discarding them... etc. It depends on how exahustive you want your search to be.

Xiangyu Li · Answer 3 · 2016-11-11T07:15:04.243

You need to have two iterators, the first pointer is the global iterator over the entire string and the second iterator serves as the search pointer. Let's suppose the first iterator points to the char "f" in your example. We need to find all the positions of "f" after the global iterator. For each "f" found after the global iterator, we need to compare characters one by one after both global iterator and local iterator (Think of this as two pointers move at the same speed until they point to different chars). Once local iterator reaches the end of the string, you can move the global iterator forward by one character (yes you need to do this n times provided you have n characters in your string).

I'm sorry that the code is in C++ but the logic is the same in Java.

Update: There is another way to perform the task. One popular solution is to use a suffix tree to store your text. You can then search the suffix tree with any given substring to find occurrences of the given substring in the whole text. Building of the tree is O(n) and search for a substring depends on the size of your alphabet which is 26 if you are using only english letters. So if you want to find all reoccurring patterns, you only need to perform the search for each substrings of the given text. Which will be only O(n^2). So this algorithm has the overall advantage over the algorithm I propose. But if you don't need performance, my algorithm will definitely suit your need, since it is simple and easy implementable.

#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main(int argc, const char * argv[]) {
    string s = "sdfssdddfssss";
    int pairCount = 0;
    vector<string> rep;
    for (int i = 0; i < s.length(); i++)
    {
        vector<int> idx;
        //find all index of all same char as s[i] after i
        //Note: You can optimize this by creating a map of index of 26 letters.
        for (int j = i+1; j < s.length(); j++)
            if (s[i] == s[j]) idx.push_back(j);
        int offset = 0;
        for (int j = 0; j < idx.size(); j++)
        {
            while (s[i+offset] == s[idx[j]+offset])
            {
                cout << "Pair found! " << s.substr(i, offset+1) << " " << i << " " << idx[j] << " " << offset + 1 << endl;
                pairCount++;
                offset++;
            }
            offset = 0;
        }
    }
    cout << "Pair count: " << pairCount;
    return 0;
}

If you use the code above, you have to be aware that this is an exhaustive search, which means it will give you all possible pairs and in the worst scenario the complexity of this algorithm is O(N^3) given that I have 3 nested for loops. — Xiangyu Li, Nov 11 '16 at 02:05
The question is in Java and tagged as Java. While a C++ answer might serve as an example for programmers that understand both languages, it certainly is useless for future readers that arrive to this question looking for a Java answer. — walen, Nov 11 '16 at 13:21

How can I check for a reoccurring pattern in a string without knowing the actual pattern?

3 Answers3