I am looking for a solution for the following problem:
I do have a dataframe with over 6 million rows which contains sequencing information (DNA sequence) in one row. Based on the way the dataset was reported, there will duplicated rows in the dataframe. BUT: These duplication are not perfect matches. Let me show this using an example.
row 1: ATCTCAGCATCATACCAACTACTA
...
row 5: ATCTCAGCATCATA..........
The previous block shows two sequences in two different rows of the data frame. The dots are just shown for visualization purposes (they are not part of the dataset).
The goal is: Mark these sequences are identical. (At the end, my goal is to assign a sequence ID to each row, so these two rows should have the same sequence ID since sequence in row 5 is part of sequence in row 1 and thus the sequences are potentially identical.
I tried to use base R's match
function or some attempts with grep
, but these approaches are all very very slow, if not failing at all.
I also tried approaches like Biostring's Matching a dictionary of patterns against a reference functions, but I am already failing at the step for creating a dictionary - as it seems due to the fact that the length of sequences in the row are very different.
(Error message from Biostring.)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
element 2 in Trusted Band has a different length than first element
Does anyone has an idea how to achieve what I want to achieve? Again, a challenge is the size of the data frame with more than 6 million rows and basically testing each row against each row in the data frame.
Thanks much for any feedback! This is really appreciated!
ADDITION OF INFORMATION Would there be a feasible way if the following assumption is true: It is only of interest when strings match at the beginning, and at least one string has to match with the complete character sequence. In other words: a complete sequence of one row can be found at the beginning of character strings in one or more different rows.