3

We have 2 DNA sequences (strings):

>1
ATGCAT
135198
>2
ATCAT

Expected output: first, we need to align these 2 strings, then get relevant annotation by index:

ATGCAT
AT-CAT
13-198

First part can be done using Biostrings package:

library(Biostrings)

p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"

x <- pairwiseAlignment(pattern = p, subject = s)

aligned(x)
# A DNAStringSet instance of length 1
#     width seq
# [1]     6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A"  "T"  "-"  "C"  "A"  "T" 

Current workaround for the 2nd part:

annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"

Works fine, but I was wondering if there is a Biostrings way of doing it (or any other packages), maybe by keeping the annotation in the metadata slot, then after aligning we get the matching annotation for matched bases in metadata, something like below:

getSlots("DNAString")
#      shared              offset              length     elementMetadata            metadata 
# "SharedRaw"           "integer"           "integer" "DataTable_OR_NULL"              "list" 

# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"

Note:

  • With author's permission stolen from BioStars - https://www.biostars.org/p/415896/
  • Author wanted biopython solution, so adding the tag, post python solution if possible, too.
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
zx8754
  • 52,746
  • 12
  • 114
  • 209

3 Answers3

2

A possible solution:

dna_fun <- function(s, p, a) {
  s <- strsplit(s, "")[[1]]
  p <- strsplit(p, "")[[1]]
  a <- strsplit(a, "")[[1]]
  ls <- length(s)
  lp <- length(p)

  r <- lapply(c(1,seq(lp)), function(x) {
    v <- rep(1, 5)
    v[x] <- 2
    v
  })

  mat <- sapply(r, rep, x = p)
  tfm <- mat == matrix(rep(s, ls), ncol = ls)
  m <- which.max(colSums(tfm))

  p2 <- mat[, m]
  p2[!tfm[,m]] <- "-"

  a[!tfm[,m]] <- "-"

  p2 <- paste(p2, collapse = "")
  a <- paste(a, collapse = "")

  return(list(p2, a))
}

with:

dna_fun(s1, s2, annot)

you get:

[[1]]
[1] "AT-CAT"

[[2]]
[1] "13-198"

If you have corresponding vectors you can use Map together with the dna_fun-function:

s11 <- c("ATGCAT","ATCGAT")
s22 <- c("ATCAT","ATCAT")
annot2 <- c("135198","145892")

lm <- Map(dna_fun, s11, s22, annot2)

data.table::rbindlist(lm, idcol = "dna")

this gives:

      dna     V1     V2
1: ATGCAT AT-CAT 13-198
2: ATCGAT ATC-AT 145-92

Data:

s1 <- "ATGCAT"
s2 <- "ATCAT"
annot <- "135198"
Jaap
  • 81,064
  • 34
  • 182
  • 193
2

As requested, a Biopython solution:

from Bio import Align

p = "ATCAT"
s = "ATGCAT"
s_annot = "135198"

aligner = Align.PairwiseAligner()
alignment = str(aligner.align(p, s)[0]).split()
middle = alignment.pop(1)
alignment.append("".join(c if m == "|" else m for c, m in zip(s_annot, middle)))

print("\n".join(alignment))

Output:

AT-CAT
ATGCAT
13-198
Jaap
  • 81,064
  • 34
  • 182
  • 193
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
0

Answer from - https://www.biostars.org/p/415896/#9564298


This just popped up again in my inbox, so thought I'd come back with the solution I came up with.

I believe I arrived at the eventual solution based on an idea from Devon Ryan.

It's embedded in to a much more convoluted codebase, but the general idea was this:

  1. Create a pairwise alignment of the sequences as normal.
  2. Iterate through the alignment and create a sort of CIGAR string which captures the various sequence 'movements'.
  3. Apply/map this same set of movements to the list of annotations corresponding to each position, thus repeating the set of operations

Code here:

https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103

zx8754
  • 52,746
  • 12
  • 114
  • 209