Increase string overlap matrix building efficiency

Question

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be

XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)

I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)

read in all strings to array

create empty NxN matrix

compare each string to every string with a higher array index (to avoid redoing comparisons)

Write longest overlap to matrix

There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.

In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.

I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.

Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed

//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
    for (int k=j+1; k<counter; k++)
        int test = determineBestOverlap(DNArray[j],DNArray[k]);

//boring stuff

//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
    int maxCounter = 0, bestOffset = 0;
    //basically just tries overlapping the strings every possible way
    for (int j=0; j<str2.length(); j++)
    {
        int counter = 0, offset = 0;
        while (str1[offset] == str2[j+offset] && str1[offset] != 'X') 
        {
              counter++;
              offset++;
        }
        
        if (counter > maxCounter)
        {
                    maxCounter = counter;
                    bestOffset = j;
        }
    }
    return maxCounter; 
} //this simplified version doesn't account for flipped strings

Are those overlaps allowed to be shifted, i.e. character 10-80 from string A match 20-90 from string B? And does an X match everything in the corresponding string, or only another X? — Guntram Blohm, Mar 02 '14 at 21:19
Yes. Any overlap anywhere within the two fragments (going either direction) must be considered. — Dustin, Mar 02 '14 at 21:20
What algorithm are you using for each comparison? You're comparing each string to a lot of others, so it would be worth doing some precomputation (Boyer Moore etc) on each string before doing all the comparisons against it. — Alan Stokes, Mar 02 '14 at 21:30
Was updating while you commented, hopefully the code answers your questions. — Dustin, Mar 02 '14 at 21:37
@AlanStokes so the answer to your question is "pretty much the worst one possible" — Dustin, Mar 02 '14 at 21:44
It seems like you’re trying to solve a problem that’s already solved (to some extent), namely [sequence assembly problem](http://en.wikipedia.org/wiki/Sequence_assembly), and more specifically, *read mapping*. The problem is far from trivial, and there exist very efficient solutions for it. What exactly are you trying to achieve that they don’t? tl;dr: You cannot compare all with all strings, the costs of that are prohibitive. This is a dead end. — Konrad Rudolph, Mar 02 '14 at 22:20
I want to write my own, mostly because the current programs available are very poorly optimized for multiple cores. I already have an extremely efficient and accurate algorithm to assemble the genome from an overlap matrix, and if you read Guntram's solution it seems doable. I just need to be able to build my own matrices in a format that is compatible with my existing algorithm. — Dustin, Mar 02 '14 at 22:27
@Dustin You must be kidding. BWA for instance scales to 1000 cores. Literally. They’ve run it one of those huge IBM supercomputers. And no, your idea is not “doable”. Creating the matrix will take longer than it will take to run any of the existing programs a hundred times. There’s a wealth of research on this particular problem out there. Read the research before programming off into the sunset. — Konrad Rudolph, Mar 02 '14 at 22:40

score 1 · Accepted Answer · answered Mar 02 '14 at 22:00

Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.

However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.

What i'd try if i were you - still without knowing if that fits your application - is:

1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.

2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)

AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC    1
......

3) Now, from the top to the bottom of the list:

a) Set the "current string" to the string topmost on the list.

b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.

c) Compare the current string with all strings of the same length, to find matches where some characters are X.

d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.

e) Repeat 3b with the last, instead of the first, character removed from the current string.

f) Remove the current string from the list.

g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.

If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.

hmmm, very interesting... I like this idea. I'll have to code it to see if it works with my data. My application is stringing together the entire genome from the individual pieces. I'm using a nearest-neighbors solution to the traveling salesman problem to traverse the most optimal set of overlaps, which is why I need the set of overlaps. — Dustin, Mar 02 '14 at 22:05

score 0 · Answer 2 · answered Mar 02 '14 at 21:57

I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.

The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:

A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F

This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Three bits are enough actually: You only have five different values. Even better, two bits are enough if you don’t need the wildcard, and many (most?) algorithms which solve this problem in practice do away with the wildcard for this reason. — Konrad Rudolph, Mar 02 '14 at 22:23

Increase string overlap matrix building efficiency

2 Answers2