Plotting gene expression data with means in a randomized experiment

Question

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.

Example gene: IL10_BL IL10_1Y IL10_fold

Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0. 100 patients have been randomized to either a statin or diet regime.

I would like to do the following plot: - Y axis should display the mean gene expression with 95% confidence limit - X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.

I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):

    genes <- read.table(header=TRUE, sep=";", text = 
    "treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
    diet;1.1;1.5;1.4;1.4;1.4;1.1;
    statin;2.5;3.3;1.3;2.7;3.1;1.1;
    statin;3.2;4.0;1.3;1.5;1.6;1.1;
    diet;3.8;4.4;1.2;3.0;2.9;0.9;
    statin;1.1;3.1;2.8;1.0;1.0;1.0;
    diet;3.0;6.0;2.0;2.0;1.0;0.5;")

I would greatly appreciate any help (or link to a similar thread) to do this.

score 1 · Accepted Answer · edited May 23 '17 at 12:23

First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.

(your command creates an empty column you might need to get rid of first: genes$X = NULL)

library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')

Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).

genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))

And put the measurement in the order that you expect:

genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))

Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).

ggplot(genes.long, aes(measurement, expression)) + 
stat_summary(fun.y = mean, geom='point') + 
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)

genes

You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Plotting gene expression data with means in a randomized experiment

1 Answers1