Aligning Multiple Files in R by Pairwise Alignment

Question

I have 15 protein sequences as fasta format in one file. I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.

But when I do, i.e. A sequence is not aligning with itself and I get NA result. Moreover, AxB gives 12131 score but BxA gives NA. Thus R can not construct phylogenetic tree.

What should I do?

I'm using this script for the loop but it reads one way only :

for (i in 1:150) { 
  globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2]) 
                              ,toString(ProtDF[D[2,i],2]),
                              substitutionMatrix = "BLOSUM62",
                              gapOpening = 0,
                              gapExtension = -2,
                              scoreOnly=FALSE,
                              type="global")
  ScoreX[i]<-c(globalpwa@score)   
  nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
  nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}

Hi iremsnmn, it's hard to imagine you get AxB but not BxA. And should AxB equal BxA? — StupidWolf, Nov 14 '19 at 21:45
@StupidWolf well hi! i have all the sequences in one file but i think the problem is I use D <- combn(15,2) command and it gives 15x14 matrix and I can't fix it. I don't know any other way to get possible combinations for all vs all to generate 15x15 matrix . https://ibb.co/545VTpt <- this is what i get. as you can see each of them skips themselves .. PS: they are subtracted from the max alignment score. — , Nov 14 '19 at 21:56
Ok how did you put them into a matrix? Because you store them under ScoreX, is it a vector and you convert that into a distance matrix? — StupidWolf, Nov 14 '19 at 22:02
no, the other problem is i gather nameseq1-2-scoreX in dataframe.. when i try to convert nameseqs as x and y and z as the filling value i see the NA's. well either way i convert the dataframe to matrix and nothing really changes — , Nov 14 '19 at 22:08
Ok yeah, so the problem is with the conversion. And, you looped from 1:150, shouldn't it be 1:105 ? I can write an example using Biostrings for you.. You just need to modify it, for yours to work — StupidWolf, Nov 14 '19 at 22:13
i didnt copy it from R, a little misspell was made :) that'd be awesome if you explain the problem about conversion as well.. i've been losing my mind for days :)) — , Nov 14 '19 at 22:17

score 2 · Accepted Answer · answered Nov 14 '19 at 22:40

I used an example fasta file, protein sequence of RPS29 in fungi.

ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD", 
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR", 
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL", 
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK", 
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR", 
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL", 
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR", 
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD", 
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR", 
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR", 
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR", 
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR", 
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)

So you got it right to use combn. As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix. See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values

library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")

ProtDF=AAStringSet(ProtDF)

# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)

#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)

# loop through D

for(idx in 1:ncol(D)){
       i <- D[1,idx]
       j <- D[2,idx]
       globalpwa<-pairwiseAlignment(ProtDF[i], 
                                    ProtDF[j],
                              substitutionMatrix = "BLOSUM62",
                              gapOpening = 0,
                              gapExtension = -2,
                              scoreOnly=FALSE,
                              type="global")
       mat[i,j]<-globalpwa@score
       mat[j,i]<-globalpwa@score
}

# if you need to make diagonal zero
diag(mat) <- 0

# plot
plot(hclust(as.dist(mat)))

thank you! i modified it according to mine but it gives different results than multiple sequence alignment.. but at least got something hold on to :) — , Nov 15 '19 at 06:30
Hi @iremsnmn if this answer solved your question please consider accepting it (https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). Yeah pairwise and multiple alignments would give different results. There are tools in R to do that. — StupidWolf, Nov 15 '19 at 08:52
Unfortunately for SO, i think your reputation is too low and I cannot invite you to a chat room..Let me know if you are interested in doing that in R. — StupidWolf, Nov 15 '19 at 08:53
hi! sorry i couldnt look here :) thaank u for ur help and yes i'm interested in R :) — , Nov 16 '19 at 09:02
I am using R, do you have any recommendations for faster approaches than that used in 'pairwiseAlignment'? What options are there for 'stripped' and maybe hashes or possibly even FFT methods? — Vass, Sep 05 '20 at 02:01

Nick · Answer 2 · 2019-11-20T17:21:49.120

An alternate method, if you're interested, using the same example as above:

library(DECIPHER)

ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD", 
            XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
            TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR", 
            TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL", 
            TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK", 
            TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR", 
            TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
            TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL", 
            `sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR", 
            `sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD", 
            `sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR", 
            TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR", 
            `sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR", 
            `sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR", 
            XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")

# All pairwise alignments:

# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)

# Initialize some outputs
AliMat <- matrix(data = list(),
                 ncol = length(ProtDF),
                 nrow = length(ProtDF))

DistMat <- matrix(data = 0,
                  ncol = length(ProtDF),
                  nrow = length(ProtDF))

# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
  for (m2 in (m1 + 1L):length(ProtDF)) {
    # Align each pair
    AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
                                  verbose = FALSE)

    # Assign a distance to each alignment, fill both triangles of the matrix
    DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
                                                         type = "dist", # return a single value
                                                         includeTerminalGaps = TRUE, # return a global distance
                                                         verbose = FALSE)
  }
}

dimnames(DistMat) <- list(names(ProtDF),
                          names(ProtDF))

Dend01 <- IdClusters(myDistMatrix = DistMat,
                     method = "NJ",
                     type = "dendrogram",
                     showPlot = FALSE,
                     verbose = FALSE)

# A single multiple alignment:

AllAli <- AlignSeqs(myXStringSet = ProtDF,
                    verbose = FALSE)

AllDist <- DistanceMatrix(myXStringSet = AllAli,
                          type = "matrix",
                          verbose = FALSE,
                          includeTerminalGaps = TRUE)

Dend02 <- IdClusters(myDistMatrix = AllDist,
                     method = "NJ",
                     type = "dendrogram",
                     showPlot = FALSE,
                     verbose = FALSE)

Dend01, from all the pairwise alignments:

Dend02, from a single multiple alignment:

Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:

BrowseSeqs(AllAli)

A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome. Sequences are displayed in the order in which they exist in the aligned string set.

EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown. Weird, but not applicable here.

Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.

combn is a great function and I use it all the time. However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.

the alignment alg of DECIPHER produces a result quite different from ClustalOmega from R's msa (HLA genes). also, is there a library that does pairwise alignment with hashes? — Vass, Sep 05 '20 at 01:59
Not being an "alignment" person I can't really describe a lot of where the differences come from, but most of the well known alignment tools have been benchmarked here: https://academic.oup.com/bioinformatics/article/36/1/90/5530966 Also, I assume the very fast aligners use hashes, but that's just a guess. — Nick, Sep 09 '20 at 15:09

Aligning Multiple Files in R by Pairwise Alignment

2 Answers2