'position-aware' aligning of sequences with letter annotations

Question

We have 2 DNA sequences (strings):

>1
ATGCAT
135198
>2
ATCAT

Expected output: first, we need to align these 2 strings, then get relevant annotation by index:

ATGCAT
AT-CAT
13-198

First part can be done using Biostrings package:

library(Biostrings)

p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"

x <- pairwiseAlignment(pattern = p, subject = s)

aligned(x)
# A DNAStringSet instance of length 1
#     width seq
# [1]     6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A"  "T"  "-"  "C"  "A"  "T"

Current workaround for the 2nd part:

annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"

Works fine, but I was wondering if there is a Biostrings way of doing it (or any other packages), maybe by keeping the annotation in the metadata slot, then after aligning we get the matching annotation for matched bases in metadata, something like below:

getSlots("DNAString")
#      shared              offset              length     elementMetadata            metadata 
# "SharedRaw"           "integer"           "integer" "DataTable_OR_NULL"              "list" 

# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"

Note:

With author's permission stolen from BioStars - https://www.biostars.org/p/415896/
Author wanted biopython solution, so adding the tag, post python solution if possible, too.

Jaap · Answer 1 · 2020-01-13T08:51:17.873

A possible solution:

dna_fun <- function(s, p, a) {
  s <- strsplit(s, "")[[1]]
  p <- strsplit(p, "")[[1]]
  a <- strsplit(a, "")[[1]]
  ls <- length(s)
  lp <- length(p)

  r <- lapply(c(1,seq(lp)), function(x) {
    v <- rep(1, 5)
    v[x] <- 2
    v
  })

  mat <- sapply(r, rep, x = p)
  tfm <- mat == matrix(rep(s, ls), ncol = ls)
  m <- which.max(colSums(tfm))

  p2 <- mat[, m]
  p2[!tfm[,m]] <- "-"

  a[!tfm[,m]] <- "-"

  p2 <- paste(p2, collapse = "")
  a <- paste(a, collapse = "")

  return(list(p2, a))
}

with:

dna_fun(s1, s2, annot)

you get:

[[1]]
[1] "AT-CAT"

[[2]]
[1] "13-198"

If you have corresponding vectors you can use Map together with the dna_fun-function:

s11 <- c("ATGCAT","ATCGAT")
s22 <- c("ATCAT","ATCAT")
annot2 <- c("135198","145892")

lm <- Map(dna_fun, s11, s22, annot2)

data.table::rbindlist(lm, idcol = "dna")

this gives:

      dna     V1     V2
1: ATGCAT AT-CAT 13-198
2: ATCGAT ATC-AT 145-92

Data:

s1 <- "ATGCAT"
s2 <- "ATCAT"
annot <- "135198"

score 2 · Answer 2 · edited Jan 13 '20 at 08:53

2

As requested, a Biopython solution:

from Bio import Align

p = "ATCAT"
s = "ATGCAT"
s_annot = "135198"

aligner = Align.PairwiseAligner()
alignment = str(aligner.align(p, s)[0]).split()
middle = alignment.pop(1)
alignment.append("".join(c if m == "|" else m for c, m in zip(s_annot, middle)))

print("\n".join(alignment))

Output:

AT-CAT
ATGCAT
13-198

edited Jan 13 '20 at 08:53

Jaap

81,064
34
182
193

answered Jan 11 '20 at 17:22

Chris_Rands

38,994
14
83
119

score 0 · Accepted Answer · answered May 24 '23 at 21:06

Answer from - https://www.biostars.org/p/415896/#9564298

This just popped up again in my inbox, so thought I'd come back with the solution I came up with.

I believe I arrived at the eventual solution based on an idea from Devon Ryan.

It's embedded in to a much more convoluted codebase, but the general idea was this:

Create a pairwise alignment of the sequences as normal.
Iterate through the alignment and create a sort of CIGAR string which captures the various sequence 'movements'.
Apply/map this same set of movements to the list of annotations corresponding to each position, thus repeating the set of operations

Code here:

https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103

'position-aware' aligning of sequences with letter annotations

3 Answers3