Optimal algorithm to find-and-remove repeating patterns in collections

Question

I recently came up to a real-world application of the algorithm I'll describe and found it interesting, so I thought I'd share and hope to get a better solution. It's somewhat similar to the popular "Longest repeated substring problem", and since that has a nice O(n) solution, this might as well. I do have a possible solution, but I might be mistaken (it might not get all the cases correctly).

Suppose you have a list/array of elements that you can compare for equality - I'll use characters for the sake of simplicity. Any consecutive repeating pattern you identify within the string can be shortened by simply replacing the part that contains all the consecutive repetitions with a single instance of said repetition.

For example, given the input string ABABABC, you can shorten it to ABC. I choose to mark the result as [AB:3]C for display, but the resulting string should be as shown - ABC.

For any shortened string, you can calculate the reduction size by calculating how much shorter the string becomes. For example, given the input string ABCABCDABAB, shortening it to ABCDAB (marked as [ABCx2]D[ABx2]) gives the reduction of 5, because the starting string is 11 characters long, and the reduced string is 6 characters long - 5 characters shorter.

Now, the question is to write an algorithm that can find the shortening of any input string that maximizes the reduction value.

My solution in Java:

import java.util.*;

public class Program {

public static void main(String[] args) {
    String data = "ABCABCDABAB";
    RepeatList optimal = getOptimal(data.toCharArray(), 0);

    String result = "";
    for(Repeat repeat : optimal.repeats) {
        String ref = new String(repeat.reference);
        result += (repeat.count > 1 ? ("[" + ref + ":" + repeat.count + "]") : ref);
    }

    System.out.println(data + " => " + result);
}

static class Repeat {
    final int index, length, count;
    char[] reference;

    Repeat(int index, int length, int count, char[] data) {
        this.index = index;
        this.length = length;
        this.count = count;
        this.reference = Arrays.copyOfRange(data, index, index + length * count);
    }
}

static class RepeatList {
    List<Repeat> repeats = new ArrayList<>();
    int reduction = 0;

    void prepend(Repeat repeat) {
        repeats.add(0, repeat);
        reduction += (repeat.count - 1) * repeat.length;
    }
}

static RepeatList getOptimal(char[] data, int from) {
    RepeatList best = new RepeatList();
    best.prepend(new Repeat(from, data.length - from, 1, data));

    for(int index = from; index < data.length; index++) {
        for(int length = 1; length <= (data.length - index) / 2; length++) {
            int count = 1;
            boolean isRepeat = true;
            for(; index + (count + 1) * length <= data.length; count++) {
                for(int elementIndex = index; elementIndex - index < length && isRepeat; elementIndex++) {
                    isRepeat = data[elementIndex] == data[elementIndex + count * length];
                }

                if(!isRepeat) {
                    break;
                }
            }

            if(count > 1) {
                RepeatList sublist = getOptimal(data, index + count * length);
                sublist.prepend(new Repeat(index, length, count, data));
                if(index > from) {
                    sublist.prepend(new Repeat(from, index - from, 1, data));
                }

                if(best.reduction < sublist.reduction) {
                    best = sublist;
                }
            }
        }
    }

    return best;
}
}

I went for the approach seen above that counts a non-repeating sub-string as a single repeating for the sake of generality. In the end I need a sort of specification/blueprint for the shortening so I can apply it to the input data. Also, I'm quite aware that some coding decisions seen above are bad, but most are such for the cause of shortening the code before putting it here (for example, string concatenation in a loop).

My question is this: is this a valid solution, and if yes, is it the simplest possible or is there something better?

I see one possible issue: the repeating patterns are always taken in consideration all at once, or none. It seems intuitive that there's an input for which this could cause a problem, but I haven't been able to muster one up. I thought along the lines of ABCABCABCDEFGBCDEFG where it's better to leave out the third AB because doing so enables one to remove one repetition of BCDEF - [AB:2]A[BCDEF:2]. And since its "all or none", I thought this could be a problem. However, the algorithm gave a solution that is equivalent - A[BA:2][BCDEF:2]. I understand how this kind of solution can be generalized (it's just a matter of where the patterns starts) and think this could prove that the "all or none" approach is OK, but I'm not sure...

I've seen questions like this on SO before, but I can't seem to find one with good answers. See e.g. https://stackoverflow.com/questions/46876515/algorithm-for-simple-string-compression/46879601 — m69's been on strike for years, Aug 21 '18 at 01:24

Optimal algorithm to find-and-remove repeating patterns in collections

0 Answers0