how to extract genes from genbank file in R

Question

I ask this question because I don't really know how to do it.

I have a genome in a gb format (YJ016_I.gb) so I want to import in R and then export all the genes in nucleotide format, or just take one of the sequence using the name of the gene.

library(genbankr)
library(stringr)
library(purrr)




gb <- genbankr::readGenBank("YJ016_I.gb")
GENES <- GenomicFeatures::genes(gb)
GenesDF <- data.frame(GENES)

  seqnames start  end width strand type gene     locus_tag old_locus_tag loctype pseudo gene_synonym       gene_id
1        I   353  787   435      - gene mioC BJE04_RS00275  VV0001, ....  normal  FALSE           NA BJE04_RS00275
2        I   835 2196  1362      - gene mnmE BJE04_RS00280  VV0002, ....  normal  FALSE           NA BJE04_RS00280
3        I  2314 3933  1620      - gene yidC BJE04_RS00285  VV0003, ....  normal  FALSE           NA BJE04_RS00285
4        I  3936 4193   258      - gene yidD BJE04_RS23545                normal  FALSE           NA BJE04_RS23545     
5        I  4160 4516   357      - gene rnpA BJE04_RS00290  VV0004, ....  normal  FALSE           NA BJE04_RS00290
6        I  4530 4670   141      - gene rpmH BJE04_RS00295  VV0005, ....  normal  FALSE           NA BJE04_RS00295

if I want to extract the sequence of BJE04_RS00275 in locus_tag, or in the other way export all genes of the genbank file

I mean

>BJE04_RS00275
aatgc
>BJE04_RS00280
ggcta
>BJE04_5000
atggcaa

how can I do it with R or if you have any solution in any language o program !!!

Thanks

score 1 · Accepted Answer · answered Jul 10 '22 at 13:09

1

I do not have access to your specific file, so I am using the example file provided by the genbankr package. You will need Biostrings to write the sequence as fasta file. To only write the sequence(s) of a particular locus_tag, just subset GENES to that particular one (e.g. subset(GENES, locus_tag == "BJE04_RS00275") in your example).

library(genbankr)
library(GenomicFeatures)
library(Biostrings)

gb <- readGenBank(system.file("sample.gbk", package="genbankr"))
GENES <- genes(gb)
res <- getSeq(gb@sequence, setNames(GENES, GENES$gene_id))
writeXStringSet(res, "YJ016_I_genes.fa")

answered Jul 10 '22 at 13:09

user12728748

8,106
2
9
14

Thanks so much, it work so good, just like I need, I was looking so much a simple way to do it, I really appreciate your help !!! Thanks !!! – abraham Jul 11 '22 at 05:49
just one more question, if I want to get the proteins, I did it as: E <- exons(gb); x <- AAStringSet(E$translation); names(x) <- E$gene_id .... But i think it easier than that, thanks !!! – abraham Jul 11 '22 at 06:38
1

For proteins, again using the example `gb`, you could probably also do `writeXStringSet(setNames(gb@cds$translation, gb@cds$protein_id), "YJ016_I_proteins.fa")`. – user12728748 Jul 11 '22 at 07:32

how to extract genes from genbank file in R

1 Answers1

Linked