We have 2 DNA sequences (strings):
>1
ATGCAT
135198
>2
ATCAT
Expected output: first, we need to align these 2 strings, then get relevant annotation by index:
ATGCAT
AT-CAT
13-198
First part can be done using Biostrings package:
library(Biostrings)
p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"
x <- pairwiseAlignment(pattern = p, subject = s)
aligned(x)
# A DNAStringSet instance of length 1
# width seq
# [1] 6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A" "T" "-" "C" "A" "T"
Current workaround for the 2nd part:
annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"
Works fine, but I was wondering if there is a Biostrings way of doing it (or any other packages), maybe by keeping the annotation in the metadata slot, then after aligning we get the matching annotation for matched bases in metadata, something like below:
getSlots("DNAString")
# shared offset length elementMetadata metadata
# "SharedRaw" "integer" "integer" "DataTable_OR_NULL" "list"
# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"
Note:
- With author's permission stolen from BioStars - https://www.biostars.org/p/415896/
- Author wanted biopython solution, so adding the tag, post python solution if possible, too.