How to find indices of identical sub-sequences in two strings in Ruby?

Question

Here each instance of the class DNA corresponds to a string such as 'GCCCAC'. Arrays of substrings containing k-mers can be constructed from these strings. For this string there are 1-mers, 2-mers, 3-mers, 4-mers, 5-mers and one 6-mer:

6 1-mers: ["G", "C", "C", "C", "A", "C"]
5 2-mers: ["GC", "CC", "CC", "CA", "AC"]
4 3-mers: ["GCC", "CCC", "CCA", "CAC"]
3 4-mers: ["GCCC", "CCCA", "CCAC"]
2 5-mers: ["GCCCA", "CCCAC"]
1 6-mers: ["GCCCAC"]

The pattern should be evident. See the Wiki for details.

The problem is to write the method shared_kmers(k, dna2) of the DNA class which returns an array of all pairs [i, j] where this DNA object (that receives the message) shares with dna2 a common k-mer at position i in this dna and at position j in dna2.

dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')

dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]

dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]

dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]

dna1.shared_kmers(4, dna2)
#=> [[2, 0]]

dna1.shared_kmers(5, dna2)
#=> []

What does the first argument to `shared_kmers` mean? The number you're passing from 2-5. — max pleaner, Nov 03 '19 at 22:05
"The problem is to write the method shared_kmers(k, dna2)..." which sounds very much like homework. "[How do I ask and answer homework questions?](https://meta.stackoverflow.com/q/334822/128421)" will help you understand the SO opinion of homework problems. Remember that SO is an extremely visible site, visited by teachers and students alike, so your question could very well be visible to someone who expects you to answer it on your own. — the Tin Man, Nov 04 '19 at 05:23

demir · Accepted Answer · 2019-11-04T21:17:19.717

3

class DNA
  attr_accessor :sequencing

  def initialize(sequencing)
    @sequencing = sequencing
  end

  def kmers(k)
    @sequencing.each_char.each_cons(k).map(&:join)
  end

  def shared_kmers(k, dna)
    kmers(k).each_with_object([]).with_index do |(kmer, result), index|
      dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
        result << [index, other_kmer_index] if kmer.eql?(other_kmer)
      end
    end
  end
end

dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')

dna1.kmers(2)
#=> ["GC", "CC", "CC", "CA", "AC"]

dna2.kmers(2)
#=> ["CC", "CA", "AC", "CG", "GC"]

dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]

dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]

dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]

dna1.shared_kmers(4, dna2)
#=> [[2, 0]]

dna1.shared_kmers(5, dna2)
#=> []

edited Nov 04 '19 at 21:17

answered Nov 03 '19 at 22:51

demir

4,591
2
22
30

1

Good answer. I tiny suggestion, so tiny that it may not even deserve mention: by replacing `chars` with `each_char` in `@sequencing.chars.each_cons(k)` you can avoid the creation of a temporary array. The rule: use `each_char` when chained to an `Enumerator` or `Enumerable` method; use `chars` when it is not chained or is chained to an `Array` method. – Cary Swoveland Nov 04 '19 at 21:01
I got it. Thanks for suggestion @CarySwoveland – demir Nov 04 '19 at 21:17

Cary Swoveland · Answer 2 · 2019-11-04T20:47:33.137

I will address the crux of your problem only, without reference to a class DNA. It should be easy to reorganize what follows quite easily.

Code

def match_kmers(s1, s2, k)
  h1 = dna_to_index(s1, k)
  h2 = dna_to_index(s2, k)
  h1.flat_map { |k,_| h1[k].product(h2[k] || []) }
end

def dna_to_index(dna, k)
  dna.each_char.
      with_index.
      each_cons(k).
      with_object({}) {|arr,h| (h[arr.map(&:first).join] ||= []) << arr.first.last}
end

Examples

dna1 = 'GCCCAC'
dna2 = 'CCACGC'

match_kmers(dna1, dna2, 2)
  #=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]] 
match_kmers(dna2, dna1, 2)
  #=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]

match_kmers(dna1, dna2, 3)
  #=> [[2, 0], [3, 1]] 
match_kmers(dna2, dna1, 3)
  #=> [[0, 2], [1, 3]]

match_kmers(dna1, dna2, 4)
  #=> [[2, 0]] 
match_kmers(dna2, dna1, 4)
  #=> [[0, 2]]

match_kmers(dna1, dna2, 5)
  #=> [] 
match_kmers(dna2, dna1, 5)
  #=> []

match_kmers(dna1, dna2, 6)
  #=> [] 
match_kmers(dna2, dna1, 6)
  #=> []

Explanation

Consider dna1 = 'GCCCAC'. This contains 5 2-mers (k = 2):

dna1.each_char.each_cons(2).to_a.map(&:join)
  #=> ["GC", "CC", "CC", "CA", "AC"]

Similarly, for dna2 = 'CCACGC':

dna2.each_char.each_cons(2).to_a.map(&:join)
  #=> ["CC", "CA", "AC", "CG", "GC"]

These are the keys of the hashes produced by dna_to_index for dna1 and dna2, respectively. The hash values are arrays of indices of where the corresponding key begins in the DNA string. Let's compute those hashes for k = 2:

h1 = dna_to_index(dna1, 2)
  #=> {"GC"=>[0], "CC"=>[1, 2], "CA"=>[3], "AC"=>[4]} 
h2 = dna_to_index(dna2, 2)
  #=> {"CC"=>[0], "CA"=>[1], "AC"=>[2], "CG"=>[3], "GC"=>[4]}

h1 shows that:

"GC" begins at index 0 of dna1
"CC" begins at indices 1 and 2 of dna1
"CA" begins at index 3 of dna1
"CC" begins at index 4 of dna1

h2 has a similar interpretation. See Enumerable#flat_map and Array#product.

The method match_kmers is then used to construct the desired array of pairs of indices [i, j] such that h1[i] = h2[j].

Now let's look at the hashes produced for 3-mers (k = 3):

h1 = dna_to_index(dna1, 3)
  #=> {"GCC"=>[0], "CCC"=>[1], "CCA"=>[2], "CAC"=>[3]} 
h2 = dna_to_index(dna2, 3)
  #=> {"CCA"=>[0], "CAC"=>[1], "ACG"=>[2], "CGC"=>[3]}

We see that the first 3-mer in dna1 is "GCC", beginning at index 0. This 3-mer does not appear in dna2, however, so there are no elements [0, X] in the array returned (X being just a placeholder). Nor is "CCC" a key in the second hash. "CCA" and "CAC" are present in the second hash, however, so the array returned is:

h1["CCA"].product(h2["CCA"]) + h1["CAC"].product(h2["CAC"]) 
  #=> [[2, 0]] + [[3, 1]]
  #=> [[2, 0], [3, 1]]

Stefan · Answer 3 · 2019-11-04T09:21:21.927

I would start by writing a method to enumerate subsequences of a given length (i.e. the k-mers):

class DNA
  def initialize(sequence)
    @sequence = sequence
  end

  def each_kmer(length)
    return enum_for(:each_kmer, length) unless block_given?

    0.upto(@sequence.length - length) { |i| yield @sequence[i, length] }
  end
end

DNA.new('GCCCAC').each_kmer(2).to_a
#=> ["GC", "CC", "CC", "CA", "AC"]

On top of this you can easily collect the indices of identical k-mers using a nested loop:

class DNA
  # ...

  def shared_kmers(length, other)
    indices = []
    each_kmer(length).with_index do |k, i|
      other.each_kmer(length).with_index do |l, j|
        indices << [i, j] if k == l
      end
    end
    indices
  end
end

dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')

dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]

Unfortunately, the above code traverses other.each_kmer for each k-mer in the receiver. We can optimize this by building a hash containing all indices for each k-mer in other up-front:

class DNA
  # ...

  def shared_kmers(length, other)
    hash = Hash.new { |h, k| h[k] = [] }
    other.each_kmer(length).with_index { |k, i| hash[k] << i }

    indices = []
    each_kmer(length).with_index do |k, i|
      hash[k].each { |j| indices << [i, j] }
    end
    indices
  end
end

How to find indices of identical sub-sequences in two strings in Ruby?

3 Answers3