0

Can suffix trees or suffix arrays be used effectively with numbers?

For example:

Can it be used with the array [1,2,3,4,5,3,9,8,5,3,9,8,6,4,5,3,9,11,9,8,7,11] to extract all possible non-overlapping repeating sub-strings of all sizes from the array's contents? If so, could you provide an implementation for the same. I am trying to achieve the same but haven't reached an effective solution.

Expected results:

4,5
4,5,3
4,5,3,9
5,3
5,3,9
5,3,9,8
...

Considering the array : [1,2,3,4,5,9,3,4,5,9,3,3,4,5,9,3], the non overlapping repeating sequence implies that the extracted group:3,4,5,9,3 is derived from the repetitions starting at indexes 2 to 6 and 11 to 15 and NOT 6 to 10

  • can you provide an example of what you're expecting? – RPresle Jan 21 '16 at 14:26
  • 1
    Considering the array that I mentioned in the question, I need to extract all repeating sub-strings from the array's contents; like 5,3,9 is repeated, so is 5,3,9,8 – hackerspark Jan 21 '16 at 14:28
  • If I am not mistaken, then the B-Tree mechanism will churn out numbers with multiple occurrences in the array, but what I seek to achieve is the extraction of all repeated sequences in the array and not occurrences of numbers in the array, just as 5,3,9 is a repeating sequence in the array, so is 5,3,9,8 – hackerspark Jan 21 '16 at 14:34
  • @slartidan Yes, that's really helpful. – hackerspark Jan 21 '16 at 16:03

1 Answers1

1

Here it is

public static void main(String[] args) {
    int[] arr = {1, 2, 3, 4, 5, 3, 9, 8, 5, 3, 9, 8, 6, 4, 5, 3, 9, 11, 9, 8, 7, 11}; // expect : 2,3  /  2,3,4  /  3,4
    Set<String> strings = new HashSet<>();
    // for every position in the array:
    for (int startPos = 0; startPos < arr.length; startPos++) {

        // from the actual position + 1 to the end of the array
        for (int startComp = startPos + 1; startComp < arr.length; startComp++) {
            int len = 0; // length of the sequence
            String sum = "";
            // while not at the end of the array, we compare two by two
            while (startComp + len < arr.length && arr[startPos + len] == arr[startComp + len]) {
                sum += arr[startPos + len];
                // if detected sequence long enough
                if (len > 0) {
                    strings.add(sum);
                }
                len++;
            }
            // just to gain some loop
            startComp = startComp + len;
        }
    }
}

For your data, my results are :

98 453 4539 45 5398 539 398 53 39

Basically, loop through your array. Foreach letter compare to every letter to its right. If you find the same letter then compare the growing sequence and add it to the set if its length>1.

Hope it helps

RPresle
  • 2,436
  • 3
  • 24
  • 28
  • Thanks, this covers the case of smaller arrays. The reason I was trying the suffix tree approach was to keep the complexity low for large arrays, probably like one with more than tens of thousands of numbers in it. Unfortunately, I seem to be falling short on the logic to achieve the same. – hackerspark Jan 21 '16 at 16:47
  • Oh and the code does not extract all possible repeated sub-strings/sub-sequences of numbers in the array. – hackerspark Jan 21 '16 at 16:48
  • I don't know about the suffix algorithm. I hope someone else can help you on this. Anyway, can you provide example of sub-string missing so I can improve the little piece of code? – RPresle Jan 21 '16 at 21:13
  • I think it only misses the repeated sub-strings/sequences of length 1 – hackerspark Jan 22 '16 at 03:54
  • I thought you didn't want them. In order to have them just remove the test if(len>0). It will work =) – RPresle Jan 22 '16 at 08:37
  • Yes, that's exactly what I modified in the code to achieve the same. Thanks. I am hoping for one that does with a lesser time complexity, currently working on the same using suffix arrays. – hackerspark Jan 22 '16 at 08:41