Find all repeating non-overlapping substrings and cycles

Question

I have a complex problem of string manipulation at hand. I have a string in which I will have cycles, as well as recurrences which I need to identify and list down.

'abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx'

Following are the possible patterns ->

Actual indexes not used

abc -> 0,3,6,9,12,15,17, ..... (occurence index for recurring string), 0,3,6,9 (unique_occurence index for recurring string, 12, 15, 17 disqualified as there abc was a part of longer repeating substring)

abcd -> 12, 15, 17 (occurence index for recurring string), 12, 15, 17 (unique occurence index for recurring string)

bcda -> 13, 16, 18.. (occurence index for recurring string), (unique occurence index for recurring string) as it is an overlap for the string abcd Hence it is something not required ab -> 0,3,6,9,12,15,17, 25, 27 ...(occurence index for recurring string), 25, 27(unique occurence index for recurring string). .....

I want to find all unique recurring occurences/recurrences, i.e. All Unique, Non-Overlapping values of recurring string. As mentioned above. And the input string may contain,

ALL cyclic patterns(abcabcabcdefdefdeflkjlkjlkj => abc, def, lkj are recurrences in cycle, but bc, ab, bcab are not expected as they are outcomes of false positives) OR

Separately recurring patterns(abcxabcdabcm => abc is recurrence but not cycle, i.e. they are not adjecent) Or

A mix of both(abcabcabcabcabclkabcdokabcdhuabcd => abc is a cyclic recurrence, and abcd is a non cyclic recurrence and we need to find both -> only abcd, abc are recurring, not bc, ab, bcda, etc)

Can someone propose a solution algo for this problem statement. I am trying using suffix_arrays which is not finding overlapping results as well.

What does "all" or "occurrence" mean? What's the right answer for "aaaaaaa"? I could argue that that's two cycles of "aaa" with something left over. I don't think the problem is at all well defined. — matt, Oct 28 '16 at 20:41
@matt Updated question, we are actually looking for recurrences, be it from cycles, or separate occurence( cycle -> 'abcabcabc', separate -> 'abcokabcdeabcll'), mix -> ('abcabcabcabcabklasdoqwepzxcpasdpabc'). Answer to 'aaaaaa' -> 'a', as it becomes a cycle of 'a'. I have already updated with problem statement, please look at it. — Nishutosh Sharma, Oct 28 '16 at 20:57
(Try cluing in language detection with `` before a code block if there is no (programming) language to detect: you can get rid of "decorated" literals and `for`. (A block quote may be more suited, anyway: it gets wrapped - you have some _very long_ lines.)) (I, for one, do find giving indices that don't correspond to any example irritating. Indices that differ by 3 can't possibly correspond to recurrences of a pattern of 4 different characters…) — greybeard, Oct 28 '16 at 22:22
@greybeard I think I already mentioned the indices are not real. ut at least give an idea of what is happening. — Nishutosh Sharma, Oct 29 '16 at 17:50

Cary Swoveland · Answer 1 · 2016-10-29T21:48:46.200

A hash is constructed whose keys consist of all unique substrings of a given string that appear at least twice in the string (not overlapping) and, for each key, the value is an array of all offsets into the string where the value of the key (a substring) begins.

Code

def recurring_substrings(str)
  arr = str.chars
  (1..str.size/2).each_with_object({}) do |n,h|
    arr.each_cons(n).map { |b| b.join }.uniq.each do |s|
      str.scan(Regexp.new(s)) { (h[s] ||= []) << Regexp.last_match.begin(0) }
    end
  end.reject { |_,v| v.size == 1 }
end

Examples

recurring_substrings 'abjkabrjkab'
  #=> {"a"=>[0, 4, 9], "b"=>[1, 5, 10], "j"=>[2, 7], "k"=>[3, 8], "ab"=>[0, 4, 9],
  #    "jk"=>[2, 7], "ka"=>[3, 8], "jka"=>[2, 7], "kab"=>[3, 8], "jkab"=>[2, 7]}

recurring_substrings "abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx"
  #=> {"a"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
  #    "b"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41, 47, 50],
  #    "c"=>[2, 5, 8, 11, 14, 20, 26, 30, 36, 42], "d"=>[15, 31, 37, 43],
  #    "k"=>[16, 17], "l"=>[21, 23], "i"=>[22, 32], "o"=>[27, 38], "p"=>[44, 45],
  #    "ab"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
  #    "bc"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41], "ca"=>[2, 5, 8, 11],
  #    "cd"=>[14, 30, 36, 42],
  #    "abc"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40], "bca"=>[1, 4, 7, 10],
  #    "cab"=>[2, 5, 8, 11], "bcd"=>[13, 29, 35, 41],
  #    "abca"=>[0, 6], "bcab"=>[1, 7], "cabc"=>[2, 8], "abcd"=>[12, 28, 34, 40],
  #    "abcab"=>[0, 6], "bcabc"=>[1, 7], "cabca"=>[2, 8],
  #    "abcabc"=>[0, 6], "bcabca"=>[1, 7], "cabcab"=>[2, 8]}

Explanation

For the first example above, the steps are as follows.

str = 'abjkabrjkab'
arr = str.chars
  #=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"] 
q = str.size/2 # max size for string to repeat at least once
  #=> 5 
b = (1..q).each_with_object({})
  #=> #<Enumerator: 1..5:each_with_object({})>

We can see which elements will be generated by this enumerator by converting it to an array. (I will do this a few more times below.)

b.to_a
  #=> [[1, {}], [2, {}], [3, {}], [4, {}], [5, {}]]

The empty hashes will be built up as calculations progress.

Next pass the first element to the block and set the block variables to it using parallel assignment (sometimes called multiple assignment).

n,h = b.next
  #=> [1, {}] 
n #=> 1 
h #=> {} 

c = arr.each_cons(n)
  #=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:each_cons(1)>

c is an array of all substrings of length 1. At the next iteration it will be an array of all substrings of length 2 and so on. See Emumerable#each_cons.

c.to_a # Let's see which elements will be generated.
  #=> [["a"], ["b"], ["j"], ["k"], ["a"], ["b"], ["r"], ["j"], ["k"], ["a"], ["b"]] 
d = c.map { |b| b.join }
  #=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"] 
e = d.uniq
  #=> ["a", "b", "j", "k", "r"]

At the next iteration this will be

r = arr.each_cons(2)
  #=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:
  #    each_cons(2)>
r.to_a
  #=> [["a", "b"], ["b", "j"], ["j", "k"], ["k", "a"], ["a", "b"],
  #    ["b", "r"], ["r", "j"], ["j", "k"], ["k", "a"], ["a", "b"]]  
s = r.map { |b| b.join }
  #=> ["ab", "bj", "jk", "ka", "ab", "br", "rj", "jk", "ka", "ab"] 
s.uniq
  #=> ["ab", "bj", "jk", "ka", "br", "rj"]

Continuing,

f = e.each
  #=> #<Enumerator: ["a", "b", "j", "k", "r"]:each> 
f.to_a # Let's see which elements will be generated.
  #=> ["a", "b", "j", "k", "r"] 

s = f.next
  #=> "a" 
r = (Regexp.new(s))
  #=> /a/ 
str.scan(r) { (h[s] ||= []) << Regexp.last_match.begin(0) }

If h does not yet have a key s, h[s] #=> nil. h[s] ||= [], which expands to h[s] = h[s] || [], converts h[s] to an empty array before executing h[s] << Regexp.last_match.begin(0). That is, h[s] = h[s] || [] #=> nil || [] #=> [].

Within the block the MatchData object is retrieved with the class method Regexp::last_match. (Alternatively, one could substitute the global variable $~ for Regexp.last_match. For details, search for "special global variables" at Regexp.) MatchData#begin returns the index of str at which the current match begins.

Now

h #=> {"a"=>[0, 4, 9]}

The remaining calculations are similar, adding key-value pairs to h until the has given in the example has been constructed.

I think, for `abcabcabc` -> `abc`, 0,3,6 was expected, it gives -> abc, ab, bc with their respective indexes instead. — Nishutosh Sharma, Oct 28 '16 at 20:36
Thanks, @EricDuminil. I wasn't aware of `MatchData#begin`. I've modified my answer to incorporate your suggestion. — Cary Swoveland, Oct 29 '16 at 21:49

score 1 · Answer 2 · answered Oct 28 '16 at 22:04

For further processing after @CarySwoveland's excellent answer :

def ignore_smaller_substrings(hash)
  found_indices = []
  new_hash = {}
  hash.sort_by{|s,_| [-s.size,s]}.each{|s,indices|
    indices -= found_indices
    found_indices |= indices
    new_hash[s]=indices unless indices.empty?
  }
  new_hash
end

pp ignore_smaller_substrings(recurring_substrings('abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx'))

Hash is sorted by decreasing string length (and then alphabetically), and indices are only allowed to appear once.

It outputs

{"abcabc"=>[0, 6],
 "bcabca"=>[1, 7],
 "cabcab"=>[2, 8],
 "abcd"=>[12, 28, 34, 40],
 "abc"=>[3, 9, 18, 24],
 "bca"=>[4, 10],
 "bcd"=>[13, 29, 35, 41],
 "cab"=>[5, 11],
 "ab"=>[46, 49],
 "bc"=>[19, 25],
 "cd"=>[14, 30, 36, 42],
 "b"=>[47, 50],
 "c"=>[20, 26],
 "d"=>[15, 31, 37, 43],
 "i"=>[22, 32],
 "k"=>[16, 17],
 "l"=>[21, 23],
 "o"=>[27, 38],
 "p"=>[44, 45]}

It doesn't answer the question exactly, but it comes a bit closer.

Find all repeating non-overlapping substrings and cycles

2 Answers2