1

I have a dataframe like this

Step <- c("1","1","4","3","2","2","3","4","4","3","1","3","2","4","3","1","2")
Length <- c(0.1,0.5,0.7,0.8,0.2,0.1,0.3,0.8,0.9,0.15,0.25,0.27,0.28,0.61,0.15,0.37,0.18)
Breadth <- c(0.13,0.35,0.87,0.38,0.52,0.71,0.43,0.8,0.9,0.15,0.45,0.7,0.8,0.11,0.11,0.47,0.28)
Height <- c(0.31,0.35,0.37,0.38,0.32,0.51,0.53,0.48,0.9,0.15,0.35,0.32,0.22,0.11,0.17,0.27,0.38)
Width <- c(0.21,0.25,0.27,0.8,0.2,0.21,0.3,0.28,0.29,0.65,0.55,0.37,0.26,0.31,0.5,0.7,0.8)

df <- data.frame(Step,Length,Breadth,Height,Width) 

I am trying to calculate the max, min, mean, median, standard deviation of the measurements grouped by step and then pivot those columns having the measurements as a column.

My desired output is

  Measurement max_1 min_1 mean_1 median_1       sd_1 max_2 min_2 mean_2 median_2       sd_2 max_3 min_3 mean_3 median_3      sd_3 max_4 min_4 mean_4 median_4       sd_4
       Length  0.50  0.10 0.3050     0.31 0.17058722  0.28  0.10 0.1900    0.190 0.07393691  0.80  0.15  0.334     0.27 0.2693139  0.90  0.61 0.7525    0.750 0.12526638
      Breadth  0.47  0.13 0.3500     0.40 0.15577760  0.80  0.28 0.5775    0.615 0.23012680  0.70  0.11  0.354     0.38 0.2383904  0.90  0.11 0.6700    0.835 0.37567720
       Height  0.35  0.27 0.3200     0.33 0.03829708  0.51  0.22 0.3575    0.350 0.12120919  0.53  0.15  0.310     0.32 0.1570032  0.90  0.11 0.4650    0.425 0.32888701
        Width  0.70  0.21 0.4275     0.40 0.23669601  0.80  0.20 0.3675    0.235 0.28952547  0.80  0.30  0.524     0.50 0.2040343  0.31  0.27 0.2875    0.285 0.01707825

I am trying to do it this way to calculate the summary statistics but its not an efficient way to do it.

library(dplyr)
df1 <- df %>%
  group_by(Step) %>%
  summarise(Length_Mean = mean(Length),
            Breadth_Mean = mean(Breadth),
            Height_Mean = mean(Height),
            Width_Mean = mean(Width))

How do I accomplish my desired output with minimal code and efficiently? Could someone point me in the right direction?

Sharath
  • 2,225
  • 3
  • 24
  • 37

1 Answers1

4

You can use a "scoped" version of summarize to calculate the same summary statistics for multiple columns at once. From ?scoped:

The variants suffixed with _if, _at or _all apply an expression (sometimes several) to all variables within a specified subset. This subset can contain all variables (_all variants), a vars() selection (_at variants), or variables selected with a predicate (_if variants).

Here summarize_all could be a good choice; it selects all columns except for the grouping columns.You can also supply several summary functions to calculate on each of the variables in the selection.

library(tidyverse)

# Calculate the summary statistics
sums <- df %>% 
  group_by(Step) %>% 
  summarize_all(funs(max, min, mean, median, sd))

sums
#> # A tibble: 4 x 21
#>   Step  Length_max Breadth_max Height_max Width_max Length_min Breadth_min
#>   <fct>      <dbl>       <dbl>      <dbl>     <dbl>      <dbl>       <dbl>
#> 1 1           0.5         0.47       0.35      0.7        0.1         0.13
#> 2 2           0.28        0.8        0.51      0.8        0.1         0.28
#> 3 3           0.8         0.7        0.53      0.8        0.15        0.11
#> 4 4           0.9         0.9        0.9       0.31       0.61        0.11
#> # ... with 14 more variables: Height_min <dbl>, Width_min <dbl>,
#> #   Length_mean <dbl>, Breadth_mean <dbl>, Height_mean <dbl>,
#> #   Width_mean <dbl>, Length_median <dbl>, Breadth_median <dbl>,
#> #   Height_median <dbl>, Width_median <dbl>, Length_sd <dbl>,
#> #   Breadth_sd <dbl>, Height_sd <dbl>, Width_sd <dbl>

Now that we have the summary statistics, all that is left to do is to reshape the data to achieve the desired output. For this, gather, spread, separate and unite from tidyr come in handy:

sums %>% 
  # Reshape to long format
  gather(col, val, -Step) %>% 
  # Separate the measurement and the summary statistic
  separate(col, into = c("Measurement", "stat")) %>% 
  arrange(Step) %>% 
  # Create the desired column headings
  unite(col, stat, Step) %>% 
  # Need to use factors to preserve order
  mutate_at(vars(col, Measurement), fct_inorder) %>% 
  # Reshape back to wide format
  spread(col, val)
#> # A tibble: 4 x 21
#>   Measurement max_1 min_1 mean_1 median_1   sd_1 max_2 min_2 mean_2
#>   <fct>       <dbl> <dbl>  <dbl>    <dbl>  <dbl> <dbl> <dbl>  <dbl>
#> 1 Length       0.5   0.1   0.305    0.31  0.171   0.28  0.1   0.19 
#> 2 Breadth      0.47  0.13  0.35     0.4   0.156   0.8   0.28  0.578
#> 3 Height       0.35  0.27  0.32     0.330 0.0383  0.51  0.22  0.358
#> 4 Width        0.7   0.21  0.428    0.4   0.237   0.8   0.2   0.368
#> # ... with 12 more variables: median_2 <dbl>, sd_2 <dbl>, max_3 <dbl>,
#> #   min_3 <dbl>, mean_3 <dbl>, median_3 <dbl>, sd_3 <dbl>, max_4 <dbl>,
#> #   min_4 <dbl>, mean_4 <dbl>, median_4 <dbl>, sd_4 <dbl>

Created on 2018-05-24 by the reprex package (v0.2.0).

Mikko Marttila
  • 10,972
  • 18
  • 31