How to convert from seqinr SeqFastadna object to Biostrings DNAStringSet for multiple sequence alignment in R

Question

I am working with DNA sequence data in fasta files, and have to work only in R for this project. I do some manipulations using the seqinr package (selecting a subset of sequences, altering the fasta headers etc). For the next stage in the analysis I want to do a multiple sequence alignment, and have used the msa R package. I can get msa working if I import a fasta file, but I'm struggling to move directly within R from the seqinr list object to the Biostrings DNAStringSet object that I have used as input for msa.

Example data - Assume that fasta_file.fasta is a fasta file with contents as follows:

>seq1
ATATATAT
>seq2
CGCGCGCG
>seq3
ATATCGCG
>seq4
ATATATAT

The code I've used is:

# Load packages
library(tidyverse)
library(seqinr)
library(msa)
library(adegenet)
library(bios2mds)
library(ape)
library(ggtree)

# Import sequences (using seqinr)
sequences <- read.fasta("fasta_file.fasta")

# Define sequences for selection
seqs_select <- c("seq1", "seq2", "seq3")

# Select chosen sequences
seq_sub <- sequences[names(sequences) %in% seqs_select]

# Check the number of sequences left
length(sequences) # 4 original
length(seq_sub) # 3 after selection

# Run multiple sequence alignment using msa
seq_alignment <- msa(seq_sub, method="ClustalOmega") # Generates an error message because seq_sub is the wrong input

msa works if I import the fasta file straight away as a Biostrings DNAstringset object:

# Import fasta file directly as Biostrings object
seq_dnastring <- readDNAStringSet("fasta_file.fasta") 
seq_alignment <- msa(seq_dnastring, method="ClustalOmega")

I could save the processed fasta file I make using seqinr and then re-load it using readDNAStringSet, but my question is whether there is a way to convert directly from seq_sub to something that msa can use as input to run an alignment with. ie. to turn seq_sub format into seq_dnastring format. Thanks for the help.

score 1 · Accepted Answer · answered Oct 23 '21 at 19:13

1

Another option is to process your sequences in Biostrings, instead of seqinr. Nonetheless, I think this does the trick

library(Biostrings)
library(seqinr)
FUN = function(x)
    paste(getSequence(x), collapse = "")
as(vapply(sequences, FUN, character(1)), "DNAStringSet")

answered Oct 23 '21 at 19:13

Martin Morgan

45,935
7
84
112

Thanks that works. While you're here... Do you have any advice on SARS-CoV-2 full genome (~30Kb) alignments in R? Using the msa package method="ClustalOmega" takes about ~30 minutes for ~20 genomes, whereas MAFFT only takes seconds for the same data. Is there anything faster within R? Can post as a separate question if that would be better. Thanks! Will. – Will Hamilton Oct 24 '21 at 14:51

How to convert from seqinr SeqFastadna object to Biostrings DNAStringSet for multiple sequence alignment in R

1 Answers1