I have n sequences each of 3 billion length (Human Genome). I am looking for efficient way to store/represent these n strings. One natural way that I can think of is graphs, where nodes can store common sub-strings among these sequences and directed edges are present between nodes where we see variation, and a set of paths P = P1 . . . Pq where each path represent original sequence..
For Example:
Suppose we have four strings S1 = ATCGGCT, S2 = ATCGATT, S3 = GTCGGCT, S4 = GTCGATT. Then the Graph should be as follows
The problem I am facing is how to find maximum common subsequence that is common among n sequence and if not then n-1 sequence and so on. Can anybody point me towards the resource where I can get the direction or pseudocode to do so ? Thanks in advance.