1

This is my first time asking a question here and I'm a beginner at R.

I have a huge dataset, where I want to get some overview of the values of multiple columns, based on category:

sampleID|category|element_1|element_2|element_3|element_4|
----------------------------------------------------------
    1   |    A   |  12.53  |   46.17 |   94.09 |  25.23  |
    2   |    B   |  19.53  |   16.17 |   14.09 |  28.23  |
    3   |    C   |  21.53  |   56.17 |   24.09 |  26.23  |
    4   |    D   |  18.53  |   96.17 |   34.09 |  21.23  |
    5   |    B   |  17.53  |   76.17 |   44.09 |  24.23  |
    6   |    A   |  32.53  |   36.17 |   54.09 |  25.23  |

What I've been trying to do is get a mean of each element by each category, what I've been mostly trying are things around tapply function in R:

tapply(data$element1, data$category, mean)

This gives me nice results for one element column, but I cannot seem to find an answer how to do that on all columns, without doing it on each column of elements by hand (mean of element1, element2, element3 etc. by category).

What I want is this:

category | element_1| element_2| element_3 
     A   |   mean   |   mean   |   mean
     B   |   mean   |   mean   |   mean
     C   |   mean   |   mean   |   mean

I've tried versions of apply and aggregate, but cannot get it to work.

Any advice is appreciated, if I need to supply more information, please let me know!

keep_swimming
  • 31
  • 1
  • 6
  • Could you share a sample of your data and the output you want please? I just want to be certain of what you're looking for! – Gainz Oct 23 '19 at 15:58

2 Answers2

2

If you only want to aggregate the columns you can use the dplyr library.

library(dplyr)
df = data.frame(sample_id = c(1,2,3,4),
                category = c("A", "B", "C", "A"),
                element1 = c(1,2,3,4),
                element2 = c(5,6,7,8),
                element3 = c(9,10,11,12))

summarise_if(df, is.numeric, mean)

or equivalent

df %>% summarise_if(is.numeric, mean)

This will apply the function mean to every column that is numeric.

T. Ewen
  • 126
  • 4
1

If you want more information than just the mean, you could look at the summary statistics.

Let's create some sample data:

library(tidyverse)
set.seed(1)

my_data <- as_tibble(matrix(runif(100), ncol = 10,
                            dimnames = list(rows = NULL,
                                            cols = paste0("Var_", 1:10))))

Now, we can see the full summary statistics by just using summary:

summary(my_data)

# Alternatively 
my_data %>%
  summary

You can use the colMeans function from base (or from the matrixStats of Rfast packages).

my_data %>%
  colMeans

If you only want to do it on a subset of your data, you can use the select function

my_data %>%
  select(Var_1, Var_2) %>%
  colMeans

Note that when you use colMeansas above, without first selecting only the numeric variables, it will throw an error. summary will still work without a problem.

EDIT:

Taking your comment into account and re-reading your (updated) question, this might be closer to what you are looking for.

library(tidyverse)

set.seed(1)

data <- tibble(
  sampleID = 1:6,
  category = c("A", "B", "C", "D", "B", "A"),
  element_1 = runif(6)*10,
  element_2 = runif(6)*10,
  element_3 = runif(6)*10,
  element_4 = runif(6)*10
  )

Which gives a dataset that looks like this:

# A tibble: 6 x 6
  sampleID category element_1 element_2 element_3 element_4
     <int> <chr>        <dbl>     <dbl>     <dbl>     <dbl>
1        1 A             4.97     7.80       2.52      5.06
2        2 B             9.93     7.62       4.23      7.16
3        3 C             3.77     6.16       2.02      1.51
4        4 D             4.78     0.510      5.02      4.79
5        5 B             1.67     6.96       3.14      2.58
6        6 A             6.07     9.76       9.99      6.47

Now, we can just make a small change and use the group_by() function

data %>%
  group_by(category) %>%
  summarize_if(is.numeric, mean)

Which will give the desired output:

  category sampleID element_1 element_2 element_3 element_4
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 A             3.5      5.52     8.78       6.26      5.77
2 B             3.5      5.80     7.29       3.69      4.87
3 C             3        3.77     6.16       2.02      1.51
4 D             4        4.78     0.510      5.02      4.79
edsandorf
  • 757
  • 7
  • 17
  • Getting the summary and mean is good and I did that before for each column! The problem is getting a column mean based on value of column category: i.e. mean value of element_1 in category A, category B, and category C! That's why I tried using tapply, but it works only on one column at the time (can't do it for element_1-element_n at the same time). – keep_swimming Oct 24 '19 at 10:33
  • You are right. I took another look at your question and edited the answer. I left the original answer there, but everything after EDIT: should be what you are looking for. If not, please let me know where I am misunderstanding you. – edsandorf Oct 24 '19 at 14:50
  • THANK YOU! This is exactly what I wanted! – keep_swimming Oct 24 '19 at 16:59