Finding all the shortest unique substring which are of same length?

Question

Given a string sequence which contains only four letters, ['a','g','c','t'] for example: agggcttttaaaatttaatttgggccc.

Find all the shortest unique sub-string of the string sequence which are of equal length (the length should be minimum of all the unique sub-strings) ?

For example : aaggcgccttt answer: ['aa', 'ag', 'gg','cg', 'cc','ct'] explanation:shortest unique sub-string of length 2

I have tried using suffix-arrays coupled with longest common prefix but i am unable to draw the solution perfectly.

No, n is not a parameter nor its given in the question. The length of the shortest unique sub-string can be anything. But for the given example it is 2, as there are no unique sub-strings of length 1. — vinay_raj, Jul 27 '19 at 15:28
OK, what have you tried so far? It is not really clear from your text, perhaps showing some code would help — norok2, Jul 27 '19 at 15:36
I have tried using suffix-arrays coupled with longest common prefix as i stated to find the count of unique sub-strings . But i dont know how to find the shortest unique sub-strings — vinay_raj, Jul 27 '19 at 15:37
Could post some code in any language of the approach you tried? — norok2, Jul 27 '19 at 15:56
Question: for an input like `aaaaaaaaaaaaaaaaaaaa` what is the expected solution? the sequence itself, or rather `'a' * (n // 2) + 1` with `n` the length of the sequence? — norok2, Jul 27 '19 at 16:23
Question: for an input like `aaa` what is the expected solution? the sequence itself, or rather `aa`? The difference here is whether to check for overlapping or non-overlapping sequences. In the case of overlapping the solution must be the sequence itself because `aa` is contained both when starting at position 0 and at position 1, so two occurrencies. In the case of non-overlapping, once the first `aa` is being picked up there is not enough room on the rest of the sequence to find a second occurrency. Note that for overlapping solutions, it must be at most of length 4 or the sequence itself. — norok2, Jul 27 '19 at 16:59
For aaaaaaaaaaaa, there is no solution, simply return blank in these type of cases. — vinay_raj, Jul 27 '19 at 17:06
@vinay_raj Surely in that case the whole string is the shortest unique substring? — m69's been on strike for years, Jul 27 '19 at 19:51

Quetzalcoatl · Answer 1 · 2019-07-27T16:41:59.603

I'm not sure what you mean by "minimum unique sub-string", but looking at your example I assume you mean "shortest runs of a single letter". If this is the case, you just need to iterate through the string once (character by character) and count all the shortest runs you find. You should keep track of the length of the minimum run found so far (infinity at start) and the length of the current run.

If you need to find the exact runs, you can add all the minimum runs you find to e.g. a list as you iterate through the string (and modify that list accordingly if a shorter run is found).

EDIT: I thought more about the problem and came up with the following solution.

We find all the unique sub-strings of length i (in ascending order). So, first we consider all sub-strings of length 1, then all sub-strings of length 2, and so on. If we find any, we stop, since the sub-string length can only increase from this point.

You will have to use a list to keep track of the sub-strings you've seen so far, and a list to store the actual sub-strings. You will also have to maintain them accordingly as you find new sub-strings.

Here's the Java code I came up with, in case you need it:

        String str = "aaggcgccttt";
        String curr = "";
        ArrayList<String> uniqueStrings = new ArrayList<String>();
        ArrayList<String> alreadySeen = new ArrayList<String>();

        for (int i = 1; i < str.length(); i++) {
            for (int j = 0; j < str.length() - i + 1; j++) {
                curr = str.substring(j, j + i); 

                if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
                    uniqueStrings.add(curr);
                    alreadySeen.add(curr);
                }
                else //Repeated sub-string found
                    uniqueStrings.remove(curr);
            }

            if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
                break;

            alreadySeen.clear();
        }

        //Output
        if (uniqueStrings.isEmpty())
            System.out.println(str);
        else {
            for (String s : uniqueStrings)
                System.out.println(s);
        }

The uniqueStrings list contains all the unique sub-strings of minimum length (used for output). The alreadySeen list keeps track of all the sub-strings that have already been seen (used to exclude repeating sub-strings).

I have updated the question for the clarity. I am not looking for the shortest runs, but for the unique substring's which are of minimum length possible. Kindly look at the updated example for clarity. — vinay_raj, Jul 27 '19 at 15:15
Wouldn't the shortest unique sub-string always be just the unique individual letters found in the original string? In your updated example "a", "c", "g" and "t" are all valid sub-strings of the original string and are shorter than the solution. — Quetzalcoatl, Jul 27 '19 at 15:23
But all of them appear more than once in the string. We are looking for the unique substring's which are not repeated more than once. — vinay_raj, Jul 27 '19 at 15:25
That is not really giving much of an answer now. *How* do you find the unique substrings? — trincot, Jul 27 '19 at 15:40
How do i find the non-repeated sub-strings and contrary to this do i really need to find all the non-repeated sub-strings in-order to find the shortest of them all ? — vinay_raj, Jul 27 '19 at 15:45
Thanks for the solution. The time complexity of the solution is O(n^2). Can you provide a solution which is in O(n) time complexity. Cause the O(n^2) solution is quite trivial to be honest. I know that time complexity of O(n) can be achieved using suffix arrays and lcp arrays , but i dont know entirely the approach to find the solution. — vinay_raj, Jul 27 '19 at 17:04
This is not *O(n²)* but *O(n³)*, as `contains` is *O(n)*, and then we are still ignoring the pair-wise string comparison it does, which adds a factor too. — trincot, Jul 27 '19 at 18:05
@trincot That is not O(n³) because `AlreadySeen` is not of size `n`. In any case, that's just bad use of data types, as an hashing would be O(1) (sort of) in the *contains* operation. — norok2, Jul 27 '19 at 19:06

norok2 · Answer 2 · 2019-07-27T19:07:36.237

I'll write some code in Python, because that's what I find the easiest. I actually wrote both the overlapping and the non-overlapping variants. As a bonus, it also checks that the input is valid. You seems to be interested only in the overlapping variant:

import itertools


def find_all(
        text,
        pattern,
        overlap=False):
    """
    Find all occurrencies of the pattern in the text.

    Args:
        text (str|bytes|bytearray): The input text.
        pattern (str|bytes|bytearray): The pattern to find.
        overlap (bool): Detect overlapping patterns.

    Yields:
        position (int): The position of the next finding.
    """
    len_text = len(text)
    offset = 1 if overlap else (len(pattern) or 1)
    i = 0
    while i < len_text:
        i = text.find(pattern, i)
        if i >= 0:
            yield i
            i += offset
        else:
            break


def is_valid(text, tokens):
    """
    Check if the text only contains the specified tokens.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.

    Returns:
        result (bool): The result of the check.
    """
    return set(text).issubset(set(tokens))


def shortest_unique_substr(
        text,
        tokens='acgt',
        overlapping=True,
        check_valid_input=True):
    """
    Find the shortest unique substring.

    Args:
        text (str|bytes|bytearray): The input text.
        tokens (str|bytes|bytearray): The valid tokens for the text.
        overlap (bool)
        check_valid_input (bool): Check if the input is valid.

    Returns:
        result (set): The set of the shortest unique substrings.
    """
    def add_if_single_match(
            text,
            pattern,
            result,
            overlapping):
        match_gen = find_all(text, pattern, overlapping)
        try:
            next(match_gen)  # first match
        except StopIteration:
            # the pattern is not found, nothing to do
            pass
        else:
            try:
                next(match_gen)
            except StopIteration:
                # the pattern was found only once so add to results
                result.add(pattern)
            else:
                # the pattern is found twice, nothing to do
                pass

    # just some sanity check
    if check_valid_input and not is_valid(text, tokens):
        raise ValueError('Input text contains invalid tokens.')

    result = set()
    # shortest sequence cannot be longer than this
    if overlapping:
        max_lim = len(text) // 2 + 1
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for pattern_gen in itertools.product(tokens, repeat=2):
                pattern = ''.join(pattern_gen)
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    else:
        max_lim = len(tokens)
        for n in range(1, max_lim + 1):
            for i in range(len(text) - n):
                pattern = text[i:i + n]
                add_if_single_match(text, pattern, result, overlapping)
            if len(result) > 0:
                break
    return result

After some sanity check for the correctness of the outputs:

shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'

shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'

funcs = shortest_unique_substr_ovl, shortest_unique_substr_not

test_inputs = (
    'aaa',
    'aaaa',
    'aaggcgccttt',
    'agggcttttaaaatttaatttgggccc',
)

import functools

for func in funcs:
    print('Func:', func.__name__)
    for test_input in test_inputs:    
        print(func(test_input))
    print()

Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}

Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}

it is wise to benchmark how fast we actually are.

Below you can find some benchmarks, produced using some template code from here (the overlapping variant is in blue):

and the rest of the code for completeness:

def gen_input(n, tokens='acgt'):
    return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = tuple(2 ** (1 + i) for i in range(16))

runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)

As far as the asymptotic time-complexity analysis is concerned, considering only the overlapping case, let N be the input size, let K be the number of tokens (4 in your case), find_all() is O(N), and the body of shortest_unique_substr is O(K²) (+ O((K - 1)²) + O((K - 2)²) + ...). So, this is overall O(N*K²) or O(N*(Σk²)) (for k = 1, …, K), since K is fixed, this is O(N), as the benchmarks seem to indicate.

If you want to get fancy, you may want to consider using a variant of the [Aho-Corasick Algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). — norok2, Jul 27 '19 at 19:36

Finding all the shortest unique substring which are of same length?

2 Answers2