A hash is constructed whose keys consist of all unique substrings of a given string that appear at least twice in the string (not overlapping) and, for each key, the value is an array of all offsets into the string where the value of the key (a substring) begins.
Code
def recurring_substrings(str)
arr = str.chars
(1..str.size/2).each_with_object({}) do |n,h|
arr.each_cons(n).map { |b| b.join }.uniq.each do |s|
str.scan(Regexp.new(s)) { (h[s] ||= []) << Regexp.last_match.begin(0) }
end
end.reject { |_,v| v.size == 1 }
end
Examples
recurring_substrings 'abjkabrjkab'
#=> {"a"=>[0, 4, 9], "b"=>[1, 5, 10], "j"=>[2, 7], "k"=>[3, 8], "ab"=>[0, 4, 9],
# "jk"=>[2, 7], "ka"=>[3, 8], "jka"=>[2, 7], "kab"=>[3, 8], "jkab"=>[2, 7]}
recurring_substrings "abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx"
#=> {"a"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
# "b"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41, 47, 50],
# "c"=>[2, 5, 8, 11, 14, 20, 26, 30, 36, 42], "d"=>[15, 31, 37, 43],
# "k"=>[16, 17], "l"=>[21, 23], "i"=>[22, 32], "o"=>[27, 38], "p"=>[44, 45],
# "ab"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
# "bc"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41], "ca"=>[2, 5, 8, 11],
# "cd"=>[14, 30, 36, 42],
# "abc"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40], "bca"=>[1, 4, 7, 10],
# "cab"=>[2, 5, 8, 11], "bcd"=>[13, 29, 35, 41],
# "abca"=>[0, 6], "bcab"=>[1, 7], "cabc"=>[2, 8], "abcd"=>[12, 28, 34, 40],
# "abcab"=>[0, 6], "bcabc"=>[1, 7], "cabca"=>[2, 8],
# "abcabc"=>[0, 6], "bcabca"=>[1, 7], "cabcab"=>[2, 8]}
Explanation
For the first example above, the steps are as follows.
str = 'abjkabrjkab'
arr = str.chars
#=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]
q = str.size/2 # max size for string to repeat at least once
#=> 5
b = (1..q).each_with_object({})
#=> #<Enumerator: 1..5:each_with_object({})>
We can see which elements will be generated by this enumerator by converting it to an array. (I will do this a few more times below.)
b.to_a
#=> [[1, {}], [2, {}], [3, {}], [4, {}], [5, {}]]
The empty hashes will be built up as calculations progress.
Next pass the first element to the block and set the block variables to it using parallel assignment (sometimes called multiple assignment).
n,h = b.next
#=> [1, {}]
n #=> 1
h #=> {}
c = arr.each_cons(n)
#=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:each_cons(1)>
c
is an array of all substrings of length 1. At the next iteration it will be an array of all substrings of length 2 and so on. See Emumerable#each_cons.
c.to_a # Let's see which elements will be generated.
#=> [["a"], ["b"], ["j"], ["k"], ["a"], ["b"], ["r"], ["j"], ["k"], ["a"], ["b"]]
d = c.map { |b| b.join }
#=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]
e = d.uniq
#=> ["a", "b", "j", "k", "r"]
At the next iteration this will be
r = arr.each_cons(2)
#=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:
# each_cons(2)>
r.to_a
#=> [["a", "b"], ["b", "j"], ["j", "k"], ["k", "a"], ["a", "b"],
# ["b", "r"], ["r", "j"], ["j", "k"], ["k", "a"], ["a", "b"]]
s = r.map { |b| b.join }
#=> ["ab", "bj", "jk", "ka", "ab", "br", "rj", "jk", "ka", "ab"]
s.uniq
#=> ["ab", "bj", "jk", "ka", "br", "rj"]
Continuing,
f = e.each
#=> #<Enumerator: ["a", "b", "j", "k", "r"]:each>
f.to_a # Let's see which elements will be generated.
#=> ["a", "b", "j", "k", "r"]
s = f.next
#=> "a"
r = (Regexp.new(s))
#=> /a/
str.scan(r) { (h[s] ||= []) << Regexp.last_match.begin(0) }
If h
does not yet have a key s
, h[s] #=> nil
. h[s] ||= []
, which expands to h[s] = h[s] || []
, converts h[s]
to an empty array before executing h[s] << Regexp.last_match.begin(0)
. That is, h[s] = h[s] || [] #=> nil || [] #=> []
.
Within the block the MatchData object is retrieved with the class method Regexp::last_match. (Alternatively, one could substitute the global variable $~
for Regexp.last_match
. For details, search for "special global variables" at Regexp.) MatchData#begin returns the index of str
at which the current match begins.
Now
h #=> {"a"=>[0, 4, 9]}
The remaining calculations are similar, adding key-value pairs to h
until the has given in the example has been constructed.