2

I have a problem at hand where I have to find all repeating patterns that exist inside a sentence.

Example : 'camel horse game camel horse gym camel horse game' # This is the sanitized string as I will cleanup anything other than words before it.

['camel horse game', 0, 3, 6] # pattern and Index where it is repeated
['camel horse', 0, 3, 6] # Another pattern, let it be a substring of the previous pattern

Suffix tree is a good solution, But I am unable to understand that how to implement it for WORDS instead of letters/characters ?

Using standard Duplicate Substringss solution will not work as it will find patterns with chipped/half words. -> 'camel horse', 'amel hor' .... 'am h' Which will not be of any use practically.

Thanks in advance.

Nishutosh Sharma
  • 1,926
  • 2
  • 24
  • 39

3 Answers3

2

You can build a suffix tree for any alphabet that you'd like. Imagine that you create an alphabet where each distinct word in the paragraph is treated as a single letter. Then, the suffix tree will let you find repeating sequences of words in the paragraph without breaking apart the words into individual characters.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • It would be great if you could explain that with some example(any language) Or a pseudo code that can throw more light by supporting the answer. – Nishutosh Sharma Oct 23 '16 at 19:11
  • I have a doubt, what if I have distinct words more than 26, after that I would have to create combination of letters, it will not be a sustainable/scalable solution in that case. – Nishutosh Sharma Oct 23 '16 at 19:13
  • There are a number of algorithms (Farach's algorithm is the first and one of the easier ones to understand) designed to build suffix trees in the case where the string consists of integer values. You can assign a numeric value to each word and then build the suffix tree out of those numbers. This is a tricky algorithm to code up yourself - as are any algorithms for building suffix trees - but if you want to go that route this would probably be the most elegant way to do it. – templatetypedef Oct 23 '16 at 19:33
  • @NishutoshSharma The number of element is irrelevant. There is no need for a mapping between words and individual letters. A good suffix tree implementation will let you work with custom types as alphabet characters. – Rerito Oct 25 '16 at 09:57
0

I found this implementation in ruby language :- http://rubyquiz.com/quiz153.html

It can be modified for finding all recurring substrings. It has a custom implementation suffix tree.

  • Can you include the relevant parts of the linked article here in the answer? Generally, link-only answers are discouraged because they tend to go stale over time. – templatetypedef Oct 26 '16 at 16:05
0
def all_repeated_substrings
  patterns = {}
  size = $string.length

  suffixes = Array.new(size)
  size.times do |i|
    suffixes[i] = $string.slice(i, size)
  end

  suffixes.sort!

  recurrence = ''
  at_least_size = 2 # the size to meet or exceed to be the new recurrence
  distance = nil
  neighbors_to_check = 1

  (1...size).each do |i|
    s1 = suffixes[i]
    neighbors_to_check.downto(1) do |neighbor|
      s2 = suffixes[i - neighbor]
      s1_size = s1.size
      s2_size = s2.size
      distance = (s1_size - s2_size).abs
      next if distance < at_least_size
      recurrence = longest_common_prefix(s1, s2, distance)
      if recurrence.size > 1
        if patterns[:"#{recurrence}"]
          patterns[:"#{recurrence}"] << (size - s2_size)
        else
          patterns[:"#{recurrence}"] = [(size - s2_size), (size - s1_size)]
        end
      end
      at_least_size = recurrence.size + 1
      if recurrence.size == distance
        neighbors_to_check = [neighbors_to_check, neighbor + 1].max
      else
        neighbors_to_check = neighbor
      end
    end
  end
  return patterns
end

Improved : http://rubyquiz.com/quiz153.html solution to the above one. I guess, But there is an issue, it will not work for 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' kind of cyclic patterns. Anyone is welcome to improve the above code for implementing the same for cyclic patterns.

Nishutosh Sharma
  • 1,926
  • 2
  • 24
  • 39