Graph Representation of N Genomes

Asked May 23 '17 at 10:17

Active May 23 '17 at 10:32

Viewed 65 times

I have n sequences each of 3 billion length (Human Genome). I am looking for efficient way to store/represent these n strings. One natural way that I can think of is graphs, where nodes can store common sub-strings among these sequences and directed edges are present between nodes where we see variation, and a set of paths P = P1 . . . Pq where each path represent original sequence..

For Example:

Suppose we have four strings S1 = ATCGGCT, S2 = ATCGATT, S3 = GTCGGCT, S4 = GTCGATT. Then the Graph should be as follows

The problem I am facing is how to find maximum common subsequence that is common among n sequence and if not then n-1 sequence and so on. Can anybody point me towards the resource where I can get the direction or pseudocode to do so ? Thanks in advance.

edited May 23 '17 at 10:32

Mahendra Gunawardena

1,956
5
26
45

asked May 23 '17 at 10:17

Raghu

Given that each of those pointers will probably be a 32-bit integer, and you'll need to keep track of which path each string takes somehow, I don't see how this approach will ever reach the efficiency of storing each string separately, encoding each base as 2 bits. – beaker May 23 '17 at 20:32

Graph Representation of N Genomes

0 Answers0