2

I have some genome positions and I want to annotate these positions (find Ensembl gene ID, the features like exonic, intronic, ...) based on Ensembl using biomaRt R package.

part of my data

  chr       start        stop     strand
chr10   100572320   100572373          -   
chr10   100572649   100572658          +   
Prradep
  • 5,506
  • 5
  • 43
  • 84
star
  • 743
  • 1
  • 7
  • 19

1 Answers1

6

Prepare you data to query biomaRt

sample data

data = data.frame(chr = "chr17", start = 63973115, end = 64437414)
data$query = paste(gsub("chr",'',data$chr),data$start,data$end, sep = ":")

#> data
#    chr    start      end                query
#1 chr17 63973115 64437414 17:63973115:64437414

Then use biomaRt

library(biomaRt)

# select your dataset of interest accordingly. 
# I have used human specific dataset identifier
# you can see all available datasets using listDatasets(mart),
# after setting your mart of interest

mart = useMart(
         'ENSEMBL_MART_ENSEMBL', 
          host = 'ensembl.org', 
          dataset = 'hsapiens_gene_ensembl')

# do listAttributes(mart) to list all information you can extract using biomaRt

out = getBM(
        attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 
                       'ensembl_transcript_id', 'ensembl_exon_id'), 
        filters = 'chromosomal_region', 
        values = data$query, 
        mart = mart)

This will give you the ensembl Ids for genes, transcripts, and exons present in given genomic location. biomaRt offers lot more information, so do not forget to use listAttributes() to find out all.

Veerendra Gadekar
  • 4,452
  • 19
  • 24
  • Thanks Veerendra for your help. But I also like to know these positions located in which regions (Intronic, exonic or Introgenic). I could not find proper attribute for it. Can you help me in this regards? – star Feb 24 '16 at 08:54
  • 1
    As i mentioned use `listAttributes()` to find all available information to extract. I think you can get the exonic coordinates lying inside this regions. If not you can always download the gtf file from ensembl ftp site directly and look into it. Another option would be to use GenomicFeatures library. It allows you to build your own database using biomart hopefully from there you could find all information you need. – Veerendra Gadekar Feb 24 '16 at 09:33
  • 1
    I think you will not find the direct annotations for the locations of all the features, so you will have to workaround a bit. For this I find `GenomicFeatures` library very useful. You can have a look at its manual before. – Veerendra Gadekar Feb 24 '16 at 09:43
  • I had an extra step after mart <- useMart(xxx) which was: set <- useDataset('btaurus_gene_ensembl', mart). Then in getBM(mart = set) – Nosey Mar 01 '21 at 19:54