Solution needed for identifying partially matching strings (DNA sequences) in a data.frame with many rows

Question

I am looking for a solution for the following problem:

I do have a dataframe with over 6 million rows which contains sequencing information (DNA sequence) in one row. Based on the way the dataset was reported, there will duplicated rows in the dataframe. BUT: These duplication are not perfect matches. Let me show this using an example.

row 1: ATCTCAGCATCATACCAACTACTA
...
row 5: ATCTCAGCATCATA..........

The previous block shows two sequences in two different rows of the data frame. The dots are just shown for visualization purposes (they are not part of the dataset).

The goal is: Mark these sequences are identical. (At the end, my goal is to assign a sequence ID to each row, so these two rows should have the same sequence ID since sequence in row 5 is part of sequence in row 1 and thus the sequences are potentially identical.

I tried to use base R's match function or some attempts with grep, but these approaches are all very very slow, if not failing at all.

I also tried approaches like Biostring's Matching a dictionary of patterns against a reference functions, but I am already failing at the step for creating a dictionary - as it seems due to the fact that the length of sequences in the row are very different.

(Error message from Biostring.)

Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  : 
  element 2 in Trusted Band has a different length than first element

Does anyone has an idea how to achieve what I want to achieve? Again, a challenge is the size of the data frame with more than 6 million rows and basically testing each row against each row in the data frame.

Thanks much for any feedback! This is really appreciated!

ADDITION OF INFORMATION Would there be a feasible way if the following assumption is true: It is only of interest when strings match at the beginning, and at least one string has to match with the complete character sequence. In other words: a complete sequence of one row can be found at the beginning of character strings in one or more different rows.

So you're asking to find all variable length segments from each row that are duplicated in the first row? Yeah, that's going to be slow, no matter what you do. — iod, Apr 28 '20 at 13:42
Yes. I basically have to check every row against every row and see if, as in my example, a complete string in one row is part of a (longer) string of another row. Again, this has to be performed on each row versus each row. The whole dataframe. Even though slow, but do you have a preference for such a taks? `grep`? `match`? — Ben, Apr 28 '20 at 13:57
Your definition of "identical", then, is not transitive. Two strings may be "identical" to a third one without being directly identical to each other. Consider `CAG` and `AGC`. They are not "identical" by your definition, but both are "identical" to `CAGC`. Will this be a problem in your real-life use of the data, and specifically of this "identification"? — , Apr 28 '20 at 14:36
Hey, if you have 6 million it's going to take eternity (almost) to calculate pairwise distances.. You need to provide more information, are the sequences of equal length? What is similar? — StupidWolf, Apr 28 '20 at 14:37
For example, if you know there is a sequence "CAGCATCATACCAACT" or basically sequences of n-length to collapse towards, you can just count the occurrence of these in the list, and collapse by that — StupidWolf, Apr 28 '20 at 14:38
Thank you everybody so far for all the suggestion of questions. Let me fill in some information and maybe one assumption: @StupidWolf: The sequences are not of equal length. Again, I do have one row and I am looking for something like a partial match in this row. The sequences range between 11 and 212 in character length. In theory e.g., a sequence in one row might be present at the beginning, the end, or somewhere in between a character string of greater length of a different row. (I will add another comment below with an assumption...) — Ben, Apr 28 '20 at 16:42
I do not want to account for matches of two partial strings. What do I mean by that? Let's say we have two strings like these: `ATCATCGTA` and `CGCATGAC`. There is a match for `CAT` starting at position 3 of both strings, but this is not what I am interested in. I am only interest in cases where one complete character string is present at some position in another character string. I do see there are multiple challenges, since the range in sequence length is so big (11-212). — Ben, Apr 28 '20 at 16:46
@GolamMazidsajib: The range of characters per row is 11-212. The total number of characters for all strings in the column of the dataframe (`sum(nchar(x))`) is 459,474,777. — Ben, Apr 28 '20 at 16:49
(I also added some information to the main text > ADDITION OF INFORMATION SECTION) — Ben, Apr 28 '20 at 16:57
In order to achieve this in this lifetime, you probably need a much more sophisticated algorithm than what you are suggesting at the moment, and probably should step away from R. You can achieve something like this by breaking your sequences into _k-mers_ of a shorter length and building a hash map of `k_mer->[sequences]`. This will help to greatly reduce the number of candidates you have to check. You can then exploit the fact that only longer sequences need to be checked, to further reduce that. Alternative options are e.g., suffix arrays — Bastian Schiffthaler, Apr 28 '20 at 18:24
In addition, there are many situations that are not trivial to resolve. A sequence could be a sub-sequence of multiple parents, which are different, or it could have parents, grandparents, and great-grand-parents... You will very likely end up with a somewhat complex graph that won't be all too easy to resolve. Maybe it's better to ask *why* you are trying to do this. To me, it sounds like you are writing the first step of an assembler, which is maybe left to software like SPADES or Trinity _etc._ — Bastian Schiffthaler, Apr 28 '20 at 18:36
Thank @BastianSchiffthaler for your response. Why am I trying to do this? I would have to dig into biology to explain this. Basically I have a table of clones characterized by their DNA sequence coming from two different "reportings" (two different data frames). They weren't created by me - I am just dealing with them. In order to compare/merge these two data frames, I have to identify which ones are identical. For the clones in df2 the reporting of clones was changed in a way that *more* nucleotides were reported per clone, so... a *perfect match scenario* would fail. — Ben, Apr 28 '20 at 19:06
Alright, in that case, a k-mer hash-map approach would probably work fast and provide the outcome you want. You can hash the sequence and clone id of your first dataset, and then check your second dataset against the first one to produce the best hit. Chances are there will be sequences that match multiple clones equally well so you won't be able to resolve them this way. If this is e.g. Illumina data, is there no barcode info for the sample ID? That would make things much easier for you — Bastian Schiffthaler, Apr 28 '20 at 19:37
Hi @BastianSchiffthaler. Thanks much for your response. The dataset is not coming from Illumina. This is coming from a company which processed the files. They only reported csv files with information for every clone. I do not have access to the real sequencing files. My idea regarding the analysis is to create a *clone super table* containing all clones from all samples and assign a *clone ID*, then mapping this clone ID back to the "raw" data frames. I am not familiar with this k-mer hash-man approach but I will google it. If you have a good resource in mind, please let me know. Thanks! — Ben, Apr 28 '20 at 20:15
You asked about approaches. `grep` should be fast enough, as you are looking for fixed-string matches. What is really slow is a regexp engine, but you don't need one here. Then - I don't know how `grep` is implemented, but for fixed-string matching I do hope it uses hashing (an idea that has been suggested here). If it's acceptable to only check initial substrings (meaning, anchored at the left end of the strings), that should be even more reasonable. The `grep` solution is trivial to write. I can also simulate this in Oracle database to compare running times. — , Apr 29 '20 at 03:34
Do you have a feel for a "reasonable" distribution of lengths among your 6 million rows, and an expected number of "matches"? Do you expect, for example, that about 1/3 of all rows will find a longer string that is a "match" - and that in most cases, "matches" will only come in pairs, not in groups of three or more? Any such a priori info may be helpful. In any case, for a one-time job I expect that it should be doable. — , Apr 29 '20 at 03:36
Are shorter strings expected to be present in full within a longer string to be considered a match? E.g., if we have "GCATTTGAC" and "CATG", would that be a match for "CAT", or no match because "CATG" doesn't appear in full in the first string? — iod, Apr 29 '20 at 13:12
@mathguy: Thanks for the comment. I will look into the `grep` options here. I cannot make such a priori assumptions unfortunately. Based on the broad range and the limit number of possible characters (DNA only has A, C, T, and G as an option) I would assume there is substantial overlap. BUT: @iod 's comment is important! I am only interested in counting matches as matches if **a short but complete** string composed of A, T, C, and G characters is found at the **beginning of a long(er)** string (so the longer one has an extension which the shorter one is missing). — Ben, Apr 29 '20 at 19:48

score 2 · Accepted Answer · 2020-05-04T00:45:50.720

Here is something I put together for the easier problem (finding sequences that are the initial substring of other sequences). The more general problem can be solved similarly, it's just messier and will take (much) longer. For the next step, I plan to simulate your data (create a text file with six million strings of lengths in the range you mentioned) and test the solution to see how long it takes. Then, as I said, I will try the same thing in Oracle database to see if there's a huge difference. The "general problem" is a reasonable project only if the "easy problem" runs in reasonable time.

I assume the data has some sort of identifier for the sequences (an id would be natural in a database). You will see that in the input file, which I show below. You will see the format of the output too - for each shorter sequence that is an initial substring of a longer one, I show both sequences (and their id's). NOTE - a shorter sequence, like ACTAGC can be the initial substring of multiple longer strings, like ACTAGCTA and ACTAGCAGCA. My output only shows one longer sequence, not all longer sequences.

In principle, the algorithm is trivial. Sort all strings alphabetically, then check each string only against the next one. If it's not a substring, then it can't be a substring of any other string in the dataset. The rest is implementing in bash.

With n sequences of length at most k, ordering all of them alphabetically is O(kn log n) and checking each string against the next is O(kn) - which is why this has a chance of working on your 6 million rows.

INPUT FILE:

$ cat input_file
10010 ACATAAGAGTGATGATAGATAGATGCAGATGACAGATG
10011 ATAGAGATGAGACAGATGACAGAAGATAGATAGAGCAGATAG
10013 ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGAC
10015 ACAGATAGCAGATAGACAGA
10016 ACAGATGACAGAAGATAGATAGA
10018 TAAGAGTGATGATAGATAGATGCAGA
10023 ATCACCGTTACAGATCG
10024 GTGATGATAGATAGATGCAGATGACAGATG
10025 ATAGAGTAGAGAGAGAGT
10030 TAAGAGTGATGATAGATAG
10044 TAAGAGTGATGATAGATAGATGCAATGA

EDIT - THE BASH SCRIPT BELOW IS PRETTY BRAIN-DEAD, APOLOGIES TO ALL. AT THE END OF THIS ANSWER I WILL SHOW THE RIGHT WAY TO DO THIS; USING SED, NOT SHELL LOOP AND READ COMMAND

BASH SCRIPT File name: dupes.sh

#!/bin/bash
sort -k 2 input_file | 
{
    read key1 seq1
    while read key2 seq2
    do
        if [[ $(expr substr $seq2 1 ${#seq1}) == $seq1 ]]
        then
            echo ""
            echo "$key1 $seq1"
            echo "$key2 $seq2"
        fi  
        key1=$key2
        seq1=$seq2
    done
}

(I used echo for output; you would probably want to redirect to a file instead.)

INVOCATION AND OUTPUT

$ ./dupes.sh

10025 ATAGAGTAGAGAGAGAGT
10013 ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGAC

10030 TAAGAGTGATGATAGATAG
10044 TAAGAGTGATGATAGATAGATGCAATGA

EDIT - AS I SAID ABOVE, WHILE THIS IS THE RIGHT ANSWER, THE SOLUTION IS TERRIBLE. Here is the right way to do this in bash. This solution takes less than a minute (!!) for the same amount of input data (rather than 80 minutes).

sort -k 2 dna_sequences | sed -nE '{N; /^[^ ]+ ([^ ]+)\n[^ ]+ \1/p; D}'

The output can be redirected to a file, or it can be processed further (for example, I am not adding a newline after each matched pair; this can be done in further processing of the output, or by other means, if necessary).

@Ben - I ran the tests in Oracle Database and in Linux (using the solution given here). I used the same algorithm in both - and almost the same functions. I created a table of 6.25 million sequences, of length uniformly distributed between 10 and 220, with 1000 pairs of "matching" strings. The solutions - in both approaches - found exactly those strings; I am certain both solutions are correct. Now: The solution using the `bash` script posted above completed in 80 minutes. The same job in the Oracle database completed in under 5 minutes. The next move is yours. — , May 02 '20 at 02:19
Thanks so much for your reply and the testing. Bash might be an option for doing this and I will look into your code (at the end, I do have to re-assign a new ID which reflects that tow or more rows are basically identical, just variations of different lengths).You are right that I do have an ID for each row, which serves as a unique identifier for the sequence. — Ben, May 04 '20 at 00:37
@Ben - I just looked at this again. I wrote a totally stupid `bash` solution, using shell loop and the `read` command. The right way is to pass the result of `sort` to `sed` and let the processing be done on a stream. The result is exactly the same, but the execution time is under one minute (!!!) for 6.25 million input sequences. Let me edit the post to add that solution. — , May 04 '20 at 00:40

Solution needed for identifying partially matching strings (DNA sequences) in a data.frame with many rows

1 Answers1