1

Using R's summary(), I want to make a table that has means, std, n, min, and max for multiple variables. I will use mtcars as a dataset (R's default dataset). If just one variable, this worked well:

as.data.frame(t(unclass(summary(mtcars$disp))))

The result:

Min. 1st Qu. Median     Mean 3rd Qu. Max.
1 71.1 120.825  196.3 230.7219     326  472

If more than one, it doesn't work well. I'm getting the same result as above (only the result for mtcars$disp shows).

as.data.frame(t(unclass(summary(mtcars$disp,mtcars$hp,mtcars$drat))))

The result (the same as above): Min. 1st Qu. Median Mean 3rd Qu. Max. 1 71.1 120.825 196.3 230.7219 326 472

The ideal result should look like this.

Min. 1st Qu. Median     Mean 3rd Qu. Max.
71.1 120.825  196.3 230.7219     326  472
52    96.5    123 146.6875     180  335
2.76    3.08  3.695 3.596563    3.92 4.93

I would like the name of variables too:

Name  Min. 1st Qu. Median     Mean 3rd Qu. Max.
disp  71.1 120.825  196.3 230.7219     326  472
hp    52    96.5    123 146.6875     180  335
drat  2.76    3.08  3.695 3.596563    3.92 4.93

Could you advise? Also in the last code, I have to repeat $mtcars many times. Is there a way to avoid this?

Thank you.

I ask a similar question here, but the suggested codes are getting very complicated. I'd like to stick with summary() if possible. R question: how to save summary results into a dataset

Kaz
  • 37
  • 6

2 Answers2

3

You can use dplyr and summarise(), which will output a tidy tibble/data.frame and and you can easily specify which summary stats you want.

mtcars %>% select(disp,hp,drat) %>% 
  gather(k,v) %>% group_by(k) %>% 
  summarise(min=min(v),median=median(v),mean=mean(v),max=max(v),n=n())

# A tibble: 3 x 6
  k       min median   mean    max     n
  <chr> <dbl>  <dbl>  <dbl>  <dbl> <int>
1 disp  71.1  196.   231.   472       32
2 drat   2.76   3.70   3.60   4.93    32
3 hp    52    123    147.   335       32
kstew
  • 1,104
  • 6
  • 21
2

You could sapply over the columns and get summary for each

cols <- c("disp", "hp", "drat")
t(sapply(mtcars[cols], summary))

#      Min. 1st Qu.  Median       Mean 3rd Qu.   Max.
#disp 71.10 120.825 196.300 230.721875  326.00 472.00
#hp   52.00  96.500 123.000 146.687500  180.00 335.00
#drat  2.76   3.080   3.695   3.596563    3.92   4.93

If you also need the names in a separate column

summary_df <- data.frame(t(sapply(mtcars[cols], summary)), check.names = FALSE)
summary_df$Name <- rownames(summary_df)
rownames(summary_df) <- NULL

summary_df
#   Min. 1st Qu.  Median       Mean 3rd Qu.   Max. Name
#1 71.10 120.825 196.300 230.721875  326.00 472.00 disp
#2 52.00  96.500 123.000 146.687500  180.00 335.00   hp
#3  2.76   3.080   3.695   3.596563    3.92   4.93 drat

To add some additional statistics, we need to create a custom function

custom_summary <- function(x) {
  c(summary(x), length = length(x), nonmissing = sum(!is.na(x)), 
                sd = sd(x, na.rm = TRUE))
}
t(sapply(mtcars[cols], custom_summary))

#      Min. 1st Qu.  Median       Mean 3rd Qu.   Max. length nonmissing          sd
#disp 71.10 120.825 196.300 230.721875  326.00 472.00     32         32 123.9386938
#hp   52.00  96.500 123.000 146.687500  180.00 335.00     32         32  68.5628685
#drat  2.76   3.080   3.695   3.596563    3.92   4.93     32         32   0.5346787
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This looks great! – Kaz Jul 17 '19 at 03:41
  • I like this approach because it's very simple. I want to add the number of cases to the table. How do I control what I can add to the summary() results? Thank you! – Kaz Jul 17 '19 at 03:54
  • @Kaz What other cases you want to add to the output? `summary` only returns the values shown in the answer above. If you need to add something else in that case you would need to write a custom function for that. – Ronak Shah Jul 17 '19 at 03:58
  • I want to add "the number of rows." The dataset mtcars includes 32 cases, so the result should say n=32. – Kaz Jul 17 '19 at 04:02
  • @Kaz Updated the answer to include `n`. – Ronak Shah Jul 17 '19 at 04:14
  • I added the part that adds n. I get number of rows in the dataset (my dataset that I am analyzing) regardless of missing values. I would like n to be the number of nonmissing values. Could you advise? – Kaz Jul 18 '19 at 01:05
  • @Kaz Yes, see updated answer with `non_missing_vals`. Here it shows same as `no_rows` because `mtcars` has no missing values. – Ronak Shah Jul 18 '19 at 01:17
  • Thank you. I forgot to ask you one more thing. I would like SD (standard deviation) from this. My goal is to create a table that has n, min, max, mean, SD for multiple variables. Thank you. – Kaz Jul 18 '19 at 01:42
  • @Kaz you just need to add another variable in the function to calculate `sd`. See the update. – Ronak Shah Jul 18 '19 at 01:47
  • I'm getting all missing values from this SD function. Is it because my variables have missing values? – Kaz Jul 18 '19 at 02:21
  • @Kaz yes, you need to add `na.rm = TRUE` in `sd`. – Ronak Shah Jul 18 '19 at 02:26