2

I have thousands of DNA sequences that look like this :).

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

I need to extract every sequence between the CTACG and CAGTC. However, many cases in these sequences come with an error (deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#> 
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#> 
#> [[3]]
#> [1] NA

reprex()
#> Error in reprex(): could not find function "reprex"

Created on 2021-12-18 by the reprex package (v2.0.1)

Like this I would be able to extract the sequence also in the last case.

UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
LDT
  • 2,856
  • 2
  • 15
  • 32
  • 1
    See [Position of Approximate Substring Matches in R](https://stackoverflow.com/questions/31843171/). – Wiktor Stribiżew Dec 18 '21 at 14:03
  • Try blast or clustalW. Or [msa](https://bioconductor.org/packages/devel/bioc/vignettes/msa/inst/doc/msa.pdf) – Andre Wildberg Dec 18 '21 at 14:03
  • 2
    uff, not quite sure R in itself is the optimal tool here, without some bioinformatics extensions, as @AndreWildberg recommended. QDAP doesn't seem to me to be the tool of choice. – Marcus Müller Dec 18 '21 at 14:07

1 Answers1

2

Using aregexec, build a regex pattern with sprintf, and finally removing the matches using gsub. Putting it into a Vectorized function to avoid overloading the script with lapplys or loops.

In the regex, the .* refers to everything before (resp. after) the respective letters. Note, that you probably need to adapt the max.distance= with your real data.

fu <- Vectorize(function(x) {
  p1 <- regmatches(x, aregexec('.*CTACG', x, max.distance=0.1))
  p2 <- regmatches(x, aregexec('CAGTC.*', x, max.distance=0.1))
  gsub(sprintf('%s|%s', p1, p2), '', x, perl=TRUE)
})

fu(ref)
# CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC 
#          "GTTATGTACGATTAAAGAAGATCGT"          "CGTTGATATTTTGCATGCTTACTCC" 
# CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC 
#          "CGTTGATATTTTGCATGCTTACTCC" 

Data:

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")
jay.sf
  • 60,139
  • 8
  • 53
  • 110