2

Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus   

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?

WJH
  • 539
  • 5
  • 14
  • It's a good question, but your expected output isn't fully specified. What happens if a string contains no letters from within the previous string? What happens if it contains letters that were present in an earlier string but not the one immediately before? Should the order be fixed according to the input vector, or should it be changed to maximize alignment? What should the format of the output be? Should it be printed to screen, returned as a character vector, or a character scalar with new line characters in it? Details matter here. – Allan Cameron May 25 '22 at 12:39
  • Sure, I added the details as far as I could. – WJH May 25 '22 at 19:25
  • You may also consider `mafft --anysymbol`. More info: https://mafft.cbrc.jp/alignment/software/anysymbol.html – Ghoti May 26 '22 at 20:39

2 Answers2

3

Here's a solution in base R that outputs a table:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

We can print the output without quotes to see it more clearly:

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

Created on 2022-05-25 by the reprex package (v2.0.1)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • How can I refer to this algorithm when mentioning this in a paper? – WJH May 25 '22 at 20:39
  • I think I would just say that "the sequences were matched by fitting them into a tabular array of all the unique characters in the entire set." – Allan Cameron May 25 '22 at 20:44
  • OK, thanks. Do you maybe also have a version that keeps the order of the letters within a string. Assume we add 'fa'. Then either the a or the f matches with a or f in other strings having a or f, but not both the a and the f can match. Or when having 'afa', the last a will not match with an a in any of the other strinigs. – WJH May 26 '22 at 09:48
  • @WJH how would you know which `a` to match in the 'afa' case? And suppose you had 'fab' - would you want the f to match on its own, with the ab sticking out at the end, or would you want the 'ab' to match, with the f sticking out on the left? – Allan Cameron May 26 '22 at 09:53
  • I would always add so that the number of matching letters is maximized. So in case of 'afa' I would match the first 'a' and the 'f', and the last 'a' sticking out on the right, and in case of 'fab' I would match 'a' and 'b' and sticking out 'f' on the 'left.' – WJH May 26 '22 at 11:30
  • @WJH that's a far harder problem, but I'll have a look at it. – Allan Cameron May 26 '22 at 11:34
  • @AllanCameron before you put too much effort into a more robust solution, let OP try out the MAFFT solution (comment on original question). – Ghoti May 26 '22 at 20:42
1

A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

Output:

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -
WJH
  • 539
  • 5
  • 14