2

I have n sequences each of 3 billion length (Human Genome). I am looking for efficient way to store/represent these n strings. One natural way that I can think of is graphs, where nodes can store common sub-strings among these sequences and directed edges are present between nodes where we see variation, and a set of paths P = P1 . . . Pq where each path represent original sequence..

For Example:

Suppose we have four strings S1 = ATCGGCT, S2 = ATCGATT, S3 = GTCGGCT, S4 = GTCGATT. Then the Graph should be as follows

enter image description here

The problem I am facing is how to find maximum common subsequence that is common among n sequence and if not then n-1 sequence and so on. Can anybody point me towards the resource where I can get the direction or pseudocode to do so ? Thanks in advance.

Mahendra Gunawardena
  • 1,956
  • 5
  • 26
  • 45
Raghu
  • 21
  • 2
  • Given that each of those pointers will probably be a 32-bit integer, and you'll need to keep track of which path each string takes somehow, I don't see how this approach will ever reach the efficiency of storing each string separately, encoding each base as 2 bits. – beaker May 23 '17 at 20:32

0 Answers0