Genomic range query ruby implementation

Question

I took a demo testing and got a result score 62. I guess my code is not efficient enough to achieve the highest score 100. So how to efficiently find lowest character code in a substring? For example, the string is s="ACGTTAGTAC". Find out what's the minimal character from the substring s[p,q] efficiently - there are many repeated queries with same s but different [p,q]. Actually, the problem is called Range Minimum Query(RMQ), and there are more than one algorithm that can solve the problem. But I have difficulties to understand and apply them to this specific instance. Can anyone advice how to fix the code?

# s is a string, p and q are arrays of integers with p[i] <= q[i]
def solution (s, p, q)
  len = s.length
  a = Array.new(len,0)
  for k in 0..len-1
    case s[k]
    when 'A'
      a[k] = 1
    when 'C'
      a[k] = 2
    when 'G'
      a[k] = 3
    when 'T'
      a[k] = 4
    end
  end
  s = []
  m = p.size
  for i in 0..m-1
    s << a[p[i]..q[i]].min
  end
  s
end

Due to copyright issue, full question is not copied to here. You may read full details from this link https://codility.com/demo/results/demoHSB3XQ-R24/.

The Codility analysis shows you are not meeting expected time targets, and even explains which test inputs cause the problems. As it stands, this question is "please debug my code on this test to get me a higher score". Suggestion: take a failing input to this script, try and improve performance, and ask about that - e.g. "How to efficiently find lowest character code in a substring?". The answer is very likely to involve knowing where all the 1's, 2's, 3's, 4's are *before* running the query, whereas your solution is correct but brute-force and does not scale well. — Neil Slater, Nov 21 '13 at 20:53
I re-edited your question to emphasize the scaling issue you have. You have two dimensions to the input, N (length of string) and M (number of queries). Your solution is reasonably efficient for large N - it is `O(N)` when M is fixed, which is about as good as it can get. However, your `O(N)` costs are split into two parts, and you repeat one of them `M` times, giving you `O(N * M)`, when it should be possible to achieve `O(N + M)` by building an efficient query structure. Basically, although it slows down single queries, you need to think about building an *index* of `s` — Neil Slater, Nov 22 '13 at 10:17
Thanks for editing. I have no idea how to build an index for a string. — canoe, Nov 22 '13 at 13:53

Neil Slater · Answer 1 · 2013-11-22T15:28:36.363

Your problem with scaling the solution is because you repeatedly scan the input for each query, generating sub-arrays and looking for minimum values directly. This is inefficient when you have a lot of queries to process. For example, if any sub-string contains "A", than a string which contains that sub-string also contains "A", but your solution throws away that prior knowledge and re-calculates. The end result is your solution not only scales by the size of input string, but you multiply that by the number of queries. When s is long and [p,q] also, this leads to poor performance.

You can improve the scaling of your code by pre-processing s into an indexed structure that is designed to answer the query most efficiently. Discovering the right structure to use is a significant part of the challenge in the coding question. Getting purely "correct output" code is only half way there, so the score metric of 62/100 seems valid.

Here is an index structure that can efficiently find the minimum character in a given index range from a fixed string.

Start by analysing the string into a two-part index

s = "AGTCTTCGATGAAGCACATG"
len = s.length

# Index to answer "what count of each character type comes next in s"
# E.g.  next_char_instance["A"][7]  returns the instance number of "A" that is
# at or after position 7  ( == 1 )
next_char_instance = Hash[ "A" => Array.new(len), "C" => Array.new(len), 
  "G" => Array.new(len), "T" => Array.new(len) ]

# Index to answer "where does count value n of this character appear in s"
# E.g.  pos_of_char_instance["A"][1]  returns the index position of 
# the second "A" ( == 8 )
pos_of_char_instance = Hash[ "A" => Array.new, "C" => Array.new, 
  "G" => Array.new, "T" => Array.new ]

# Basic building block during iteration
next_instance_ids = Hash[ "A" => 0, "C" => 0, "G" => 0, "T" => 0 ]

# Build the two indexes - O( N )
(0...len).each do |i|
   next_instance_ids.each do | letter, next_instance_id |
     next_char_instance[letter][i] = next_instance_id
   end
   this_letter = s[i]
   pos_of_char_instance[ this_letter ] << i
   next_instance_ids[ this_letter ] += 1
end

So that's O( N ) because you have iterated the string once, all the other effects are (effectively) constant; ok, creating the arrays is also O( N ), but probably 10 times faster, and if you find yourself thinking O( 1.4 * N ), then no panic, your throw away the constant 1.4 when considering purely scaling issues.

Now you have this index, it is possible to ask in turn "Where is the next A (or C or G) at or after this position" really efficiently, and you can use that to quickly find the minimal character inside a particular range. In fact as it will be fixed-cost lookups and a few comparisons, it will be O( 1 ) for each query, and therefore O( M ) overall:

# Example queries
p = [ 0, 3, 2, 7 ]    
q = [ 6, 4, 2, 9 ]    

# Search for each query - O( M )
p.zip(q).map do | a, b |
  # We need to find lowest character possible that fits into the range
  "ACGT".chars.find do | letter |
     next_instance_id = next_char_instance[ letter ][ a ]
     pos_next_instance = pos_of_char_instance[ letter ][ next_instance_id ]
     true if pos_next_instance && pos_next_instance <= b
  end
end
# => ["A", "C", "T", "A"] is output for example data

I've left this mapped to the letters, hopefully you can see that output 1,2,3,4 is trivial addition to this. In fact the numbering, and use of genome-style letters, are red herrings in the puzzle, they have no real impact on the solution (other than generating structure for just 4 letters is easier to code as fixed values).

The above solution is one of possibly many. You should note it relies on number of allowed letters to be fixed, and would not work as an answer to finding a minimum value in a range of where the individual entries were integers or floats.

Genomic range query ruby implementation

1 Answers1