How can I select human miRNA from affy chip while analyzing data using R?

Question

I am new to R and want to analyze miRNA expression from a data set of 3 groups. Can anyone help me out.

In this case I got other miRNAs(on affy chips) as top expressed genes. Now I want to select only human miRNAs. Please help me

Thanks in advance

If you are truly new to R and need help with the very basics, work through all of the exercises at http://www.r-tutor.com/. — Alexander, Feb 26 '12 at 16:14

Alexander · Answer 1 · 2012-02-27T12:11:09.227

Summary

I'm not entirely sure what your data frame looks like, given that I haven't worked with Affy chips before. Let me try to summarize what I think you have told us. You have a data frame with a list of all of the microRNAs on the Affy chip, along with their expression data. You want to select a subset of these microRNAs that are unique to humans.

Possible solution 1

You do not state whether or not your data frame contains a variable that identifies whether or not these microRNAs are indeed from humans. If it does have this information, all you would need to do is subset your data based on this identifier. Type help(subset) or help(Extract) for more information on how to do this.

Possible solution 2

If your data frame does not contain such an identifier, you will first need to make a list of all known human microRNAs. You could retrieve these manually from the online miRBase website (and then import them into R), or you could download them from Ensembl using the R package biomaRt. To do the latter, after loading biomaRt, you might type this command:

miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)

The above code requests that R download the mirbase identifier, gene ID, start position, and chromosome name for all microRNAs in the miRBase catalog. (Note that you would have to specify the human Ensembl mart in an earlier command, which I have not shown).

Once you have downloaded this information, you could use a merge command or perhaps a which command to pull the appropriate microRNAs from your Affy chip data.

Recommendations

This all might sound a bit complicated. If you haven't already, I recommend that you spend some time working through exercises on biomaRt and bioconductor. Information about these packages, and how to install them, are available at the below links:

Bioconductor, http://www.bioconductor.org/install/
Database mining with biomaRt, http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf

You might consider asking for this question to be migrated to Biostar. I think you would get better responses there. Also, consider editing your question to provide more information about your data. Good luck.

Edit to my original answer

In reference to your comment made at 2012-02-26 22:08:02, try the following:

## Load biomaRt package
library(biomaRt)

## Specify which "mart" (i.e., source of genetic data) that you want to use
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)

## You can then ask the system what attributes are available for download
listAttributes(ensembl)

      name                      description
58    mirbase_accession         miRBase Accession(s)
59    mirbase_id                miRBase ID(s)
60    mirbase_gene_name         miRBase gene name
61    mirbase_transcript_name   miRBase transcript

Above I have pasted part of the output from the listAttributes() command, which shows the relevant miRBase options. Now you can try the following code:

## Download microRNA data
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)

## Check how much we downloaded
> dim(miRNA)
[1] 715   4

## Peak at the head of our data
> head(miRNA)
      mirbase_id ensembl_gene_id start_position chromosome_name
1 hsa-mir-320c-1 ENSG00000221493       19263471              18
2 hsa-mir-133a-1 ENSG00000207786       19405659              18
3    hsa-mir-1-2 ENSG00000207694       19408965              18
4 hsa-mir-320c-2 ENSG00000212051       21901650              18
5    hsa-mir-187 ENSG00000207797       33484781              18
6   hsa-mir-1539 ENSG00000222690       47013743              18

## Check which chromosomes are contributing to our data
> table(miRNA$chromosome_name)

 1 10 11 12 13 14 15 16 17 18 19  2 20 21 22  3  4  5  6  7  8  9  X 
50 27 26 25 15 59 26 15 35  7 85 23 32  5 16 31 23 30 17 33 27 28 80

Now your challenge will be to use this downloaded data to parse your original Affy data frame. Again, read the help files for the merge, Extract, and which functions to give it a try yourself first.

ExpressionSet (storageMode: lockedEnvironment) assayData: 20706 features, 55 samples element names: exprs protocolData sampleNames: EC1_(miRNA).CEL EC21_(miRNA).CEL ... YC85_(miRNA).CEL (55 total) varLabels: ScanDate varMetadata: labelDescription phenoData sampleNames: EC1_(miRNA).CEL EC21_(miRNA).CEL ... YC85_(miRNA).CEL (55 total) varLabels: sample varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: mirna20 — Shafi, Feb 26 '12 at 19:00
Try pressing the "edit" button above next to your original question to incorporate any new information that you want to add rather than putting it as a comment. — Alexander, Feb 26 '12 at 19:09
Your information is very helpful. As I am totally new(may be since couple of weeks I am trying to sue R) its very difficult for me. If you can give me some more hints. I will greatly appreciate — Shafi, Feb 26 '12 at 19:46
Following you suggestions I got error . Error in getBM(c("mirbase", "ensembl_gene_id", "start_position", "chromosome_name"), : Invalid attribute(s): mirbase Please use the function 'listAttributes' to get valid attribute names — Shafi, Feb 26 '12 at 22:08
I have edited my original answer above. Give my new suggestions a try and see how it goes. Good luck. — Alexander, Feb 27 '12 at 13:42
You suggested script worked very well. Now I need to draw heat map and clustering. How should I go about it, please suggest. — Shafi, Feb 28 '12 at 08:48
`Bioconductor` can be used to draw heat maps. See [here](http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/heatmap/) and [here](http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#R_graphics_heat) for more details. — Alexander, Feb 28 '12 at 13:26
Thanks for you support. Is it necessary to do PCA (for sample clustering)? If so how should I do it? before or after normalization? — Shafi, Mar 02 '12 at 21:48
Sorry, I'm not entirely sure of the answer to that question. I think your best bet would be to make a new question on biostar.stackexchange.com if after a literature search you are still unsure. If you feel that I answered your previous question please consider selecting my answer above. :-) Good luck! — Alexander, Mar 05 '12 at 20:04