0

I am working with DNA sequence data in fasta files, and have to work only in R for this project. I do some manipulations using the seqinr package (selecting a subset of sequences, altering the fasta headers etc). For the next stage in the analysis I want to do a multiple sequence alignment, and have used the msa R package. I can get msa working if I import a fasta file, but I'm struggling to move directly within R from the seqinr list object to the Biostrings DNAStringSet object that I have used as input for msa.

Example data - Assume that fasta_file.fasta is a fasta file with contents as follows:

>seq1
ATATATAT
>seq2
CGCGCGCG
>seq3
ATATCGCG
>seq4
ATATATAT

The code I've used is:

# Load packages
library(tidyverse)
library(seqinr)
library(msa)
library(adegenet)
library(bios2mds)
library(ape)
library(ggtree)

# Import sequences (using seqinr)
sequences <- read.fasta("fasta_file.fasta")

# Define sequences for selection
seqs_select <- c("seq1", "seq2", "seq3")

# Select chosen sequences
seq_sub <- sequences[names(sequences) %in% seqs_select]

# Check the number of sequences left
length(sequences) # 4 original
length(seq_sub) # 3 after selection

# Run multiple sequence alignment using msa
seq_alignment <- msa(seq_sub, method="ClustalOmega") # Generates an error message because seq_sub is the wrong input

msa works if I import the fasta file straight away as a Biostrings DNAstringset object:

# Import fasta file directly as Biostrings object
seq_dnastring <- readDNAStringSet("fasta_file.fasta") 
seq_alignment <- msa(seq_dnastring, method="ClustalOmega")

I could save the processed fasta file I make using seqinr and then re-load it using readDNAStringSet, but my question is whether there is a way to convert directly from seq_sub to something that msa can use as input to run an alignment with. ie. to turn seq_sub format into seq_dnastring format. Thanks for the help.

Will Hamilton
  • 357
  • 2
  • 17

1 Answers1

1

Another option is to process your sequences in Biostrings, instead of seqinr. Nonetheless, I think this does the trick

library(Biostrings)
library(seqinr)
FUN = function(x)
    paste(getSequence(x), collapse = "")
as(vapply(sequences, FUN, character(1)), "DNAStringSet")
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thanks that works. While you're here... Do you have any advice on SARS-CoV-2 full genome (~30Kb) alignments in R? Using the msa package method="ClustalOmega" takes about ~30 minutes for ~20 genomes, whereas MAFFT only takes seconds for the same data. Is there anything faster within R? Can post as a separate question if that would be better. Thanks! Will. – Will Hamilton Oct 24 '21 at 14:51