How to remove gaps or free spaces in a FASTA format argument in R?

Question

I have a DNA sequence

x<-"CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG
CACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT
CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAAT
ACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
AATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAAC
AATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAA
AGTTCCTCAATATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACA
CAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCACCGAGC
AATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCC
ATTCCCATATGCTAACCGCAATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTAT
GTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTT
ATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGG
CGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTAT
TGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATG
CCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTT
GGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGC
ATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTT
TTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGAC"

I want to pick substring from this for which i used following code.

library(stringi)
library(stringr)

stri_sub(x, from=335, to=649)

Expected output should start from ATG but it is not. Please help me in solving this. Moreover argument could be of length 50k or more. Also, is there any way to convert this fasta file into a string. Please excuse me for my language i am new to R.

Maybe you want to use Bioconductor [Biostrings](https://bioconductor.org/packages/Biostrings) for working with fasta files (via `readDNAStringSet()`) ? — Martin Morgan, Jun 16 '18 at 07:01
Cannot reproduce: `x <- gsub("[^ATCG]", "", x); substr(x, start = 335, stop = 649)`. Maybe there are characters in the string other than "ATCG". — Rui Barradas, Jun 16 '18 at 07:17

score 1 · Answer 1 · answered Jun 16 '18 at 09:22

I'm not sure why you're loading both stringr and stringi. Either way, the issue with extracting the substring are the linebreaks ("\n") in x. You can simply remove them with gsub.

Option 1 using `Biostrings`

Using the Biostrings library is the preferred option here, especially if you need to do further work with DNA sequences. Biostrings (and the other Bioconductor libraries) have been developed to make most of the typical operations involving DNA/RNA sequence data a lot easier.

library(Biostrings);
extractAt(BString(gsub("\n", "", x)), IRanges(335, 649));
#A BStringSet instance of length 1
#  width seq
#[1]   315 ATGATCGTAAATAACACACACGTGCTTACCCTAC...ATATCTCATTCGGCGGTCCCAAATATTGTATAA

A side note: If in fact your sequence data is in a FASTA file, you can simply read in the FASTA file with readDNAStringSet which will automatically take care of line breaks. Take a look at ?readDNAStringSet for details.

Option 2 using `stringr::str_sub`

stringr::str_sub(gsub("\n", "", x), start = 335, end = 649);
#[1] "ATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAA"

Option 3 using `stringi::stri_sub`

stringi::stri_sub(gsub("\n", "", x), from = 335, to = 649);
#[1] "ATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAA"

score 1 · Answer 2 · answered Mar 11 '21 at 22:58

How to remove gaps or free spaces or ambiguous characters in a DNAstringset or AAStringSet or RNAstringetset in R.

An example with AAstringset with ambiguous character X that needs to be removed.

Load packages in R

library(deepredeff)#converts AAstringset to dataframe

library(DECIPHER)# provides BrowseSeq, align functions

library("Biostrings")#provides AAStringSet,DNAStringSet
Read FASTA file into R

seq <- readAAStringSet("CDS.txt") #load sequences to R

Browse the sequences to see how they look

`BrowseSeqs(seq, highlight=0)`#browse

To remove ambiguous characters such as X in my case, first convert the seq AAstringSet to seq dataframe.

seqdf <- aasset_to_df(seq)# convert to data frame

seqdf1 <- seqdf[-grep("X", seqdf$seq),]# Remove all sequences that have X in their sequences

seqdf1 <- seqdf[!grepl("X", seqdf$seq),]#also works
Now that you have removed unwanted character X from that data frame, convert the columns 'name' and 'seq' to separate lists of values.

names <- seqdf1$name#names values

seq <- seqdf1$seq#seq values

Then convert back the seq values to AAstringSet

seqAA <- AAStringSet (seq)

Then attach back their names by reassigning their names from the list of values 'names'.

names(seqAA) <- names

Finally you are done, so you may want to browse the seqAA to check.

BrowseSeqs(seqAA, highlight=0)

From the clean Amino acid AAstringSet, in my case, you can do further filtering as you want e.g. You can extract genes of your interest or exclude gene that you do not want or even to align them.

seqAA1 <- seqAA[grepl("gene=S", seqAA@ranges@NAMES)]#get gene of interest that matches portion 'gene=S'.

BrowseSeqs(seqAA1, highlight=0)#Browse to check.

seqAA2 <- seqAA[!grepl("gene=S", seqAA@ranges@NAMES)]#Exlude gene that you do not want that matches portion 'gene=S'.

BrowseSeqs(seqAA2, highlight=0)#Browse to check.

seqAA3 <- seqAA[grep(1270, seqAA@ranges@width)]# get sequences that have 1270 amino acids.

BrowseSeqs(seqAA3, highlight=0)#Browse to check.

seq3 <- seqAA[!grep(1270, seqAA@ranges@width)]# Exclude sequences that have 1270 amino acids.

BrowseSeqs(seqAA3, highlight=0)#Browse to check.

alignedseqAA3 <- AlignSeqs(AAStringSet(seqAA3))#align

BrowseSeqs(alignedseqAA3, highlight=0)#Browse to check.

Best regards,

Evans

A. Suliman · Answer 3 · 2018-06-16T06:56:27.397

0

Using Base R

substr(x, 1, 3)
"CCA"

Type ?substr for more details. You can check here on how to import fasta format into R

edited Jun 16 '18 at 06:56

answered Jun 16 '18 at 06:46

A. Suliman

12,923
5
24
37

How to remove gaps or free spaces in a FASTA format argument in R?

3 Answers3

Option 1 using Biostrings

Option 2 using stringr::str_sub

Option 3 using stringi::stri_sub

Option 1 using `Biostrings`

Option 2 using `stringr::str_sub`

Option 3 using `stringi::stri_sub`