2

I have a dataframe, p4p5, that contains the following columns:

p4p5 <- c("SampleID", "expr", "Gene", "Period", "Consequence", "isPTV")

I've used the aggregate function here to find the median expression per Gene:

p4p5_med <- aggregate(expr ~ Gene, p4p5, median)

However, this results in a dataframe with the columns "expr" and "Gene" only. How can I still retain all the original columns when applying the aggregate function?

UPDATE:

Input (p4p5):

SampleID   expr  Gene        Period  Consequence            isPTV
HSB430    -1.23  ENSG000098  4       upstream_gene_variant  0
HSB321    -0.02  ENSG000098  5       stop_gained            1
HSB296     3.12  ENSG000027  4       upstream_gene_variant  0
HSB201     1.22  ENSG000027  4       intron_variant         0
HSB220     0.13  ENSG000013  6       intron_variant         0

Expected output:

SampleID   expr  Gene        Period  Consequence           isPTV  Median
HSB430    -1.23  ENSG000098  4       upstream_gene_variant  0    -0.625 
HSB321    -0.02  ENSG000098  5       stop_gained            1    -0.625
HSB296     3.12  ENSG000027  4       upstream_gene_variant  0     2.17
HSB201     1.22  ENSG000027  4       intron_variant         0     2.17
HSB220     0.13  ENSG000013  6       intron_variant         0     0.13
claudiadast
  • 591
  • 3
  • 11
  • 33
  • 1
    `aggregate()` doesn't return every column by design: the output of this function is the result of an aggregation and you can't combine it with the raw data (even conceptually). If you want the aggregation to be done on every column, you have to specify that explicitly – 12b345b6b78 Nov 27 '18 at 22:14
  • 1
    Please include some example data from `p4p5` in your question. The short answer is: you would need to join the aggregated data back to the original. Or yo could use `dplyr` to `group_by`, then `mutate` the data. – neilfws Nov 27 '18 at 22:15
  • 1
    So something like the following? ```p4p5_med <- p4p5 %>% select(Gene, expr, SampleID, Period, isPTV) %>% group_by(Gene) %>% mutate(Median = median(expr)) ``` I tried this but it gives me the same median value for everything. – claudiadast Nov 27 '18 at 22:27
  • Yes - I wrote my answer before seeing your comment and the output is from your example input. If there are different values for `expr` and > 1 value for `Gene`, the medians by group should be different. – neilfws Nov 27 '18 at 22:33

1 Answers1

1

I'd use dplyr for this:

library(dplyr)

p4p5 %>% 
  group_by(Gene) %>% 
  mutate(Median = median(expr, na.rm = TRUE)) %>%
  ungroup()

  SampleID  expr Gene       Period Consequence           isPTV Median
  <chr>    <dbl> <chr>       <int> <chr>                 <int>  <dbl>
1 HSB430   -1.23 ENSG000098      4 upstream_gene_variant     0 -0.625
2 HSB321   -0.02 ENSG000098      5 stop_gained               1 -0.625
3 HSB296    3.12 ENSG000027      4 upstream_gene_variant     0  2.17 
4 HSB201    1.22 ENSG000027      4 intron_variant            0  2.17 
5 HSB220    0.13 ENSG000013      6 intron_variant            0  0.13
neilfws
  • 32,751
  • 5
  • 50
  • 63