How to summarize several independent variables at once in R?

Question

For example, if the data is like below,

Cultivar=rep(c("CV1","CV2"),each=12)
Nitrogen=rep(rep(c("N0","N1","N2","N3"), each=3),2)
Block=rep(c("I","II","III"),8)
Yield=c(99,109,89,115,142,133,121,157,142,125,150,139,82,104,99,117,
        125,127,145,154,154,151,166,175)
Protein=c(25,35,45,55,44,33,21,57,42,25,50,39,72,14,79,71,25,27,45,54,47,51,66,75)
dataA=data.frame(Cultivar,Nitrogen,Block,Yield,Protein)

I'd like to summarize yield and protein data. So I used the below code.

library (plyr)
dataB=ddply(dataA, c("Cultivar","Nitrogen"), summarise, mean=mean(Yield), 
            sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))
dataC=ddply(dataA, c("Cultivar","Nitrogen"), summarise, mean=mean(Protein), 
            sd=sd(Protein), n=length(Protein), se=sd/sqrt(n))
dataB$Protein=dataC$mean
dataB$Protein_se=dataC$se
dataB

  Cultivar Nitrogen mean        sd n        se  Protein Protein_se
1      CV1       N0   99 10.000000 3  5.773503 35.00000   5.773503
2      CV1       N1  130 13.747727 3  7.937254 44.00000   6.350853
3      CV1       N2  140 18.083141 3 10.440307 40.00000  10.440307
4      CV1       N3  138 12.529964 3  7.234178 38.00000   7.234178
5      CV2       N0   95 11.532563 3  6.658328 55.00000  20.599353
6      CV2       N1  123  5.291503 3  3.055050 41.00000  15.011107
7      CV2       N2  151  5.196152 3  3.000000 48.66667   2.728451
8      CV2       N3  164 12.124356 3  7.000000 64.00000   7.000000

But I believe there are much simple codes to summarize several independent variables at once.

Could you let me know how to do that?

Many thanks,

jpsmith · Accepted Answer · 2023-04-18T20:53:57.127

You could use dplyr::summarize across the desired columns and specify the groups using .by and put all the summary statistics you want in a list:

library(dplyr)

dataA %>%
  summarize(across(Yield:Protein, 
                   .fns = list(Mean = mean, 
                               SD = sd, 
                               n = length,
                               se = ~ sd(.x)/sqrt(length(.x)))), 
            .by = c("Cultivar", "Nitrogen"))

Output:

 Cultivar Nitrogen Yield_Mean  Yield_SD Yield_n  Yield_se Protein_Mean Protein_SD Protein_n Protein_se
1      CV1       N0         99 10.000000       3  5.773503     35.00000  10.000000         3   5.773503
2      CV1       N1        130 13.747727       3  7.937254     44.00000  11.000000         3   6.350853
3      CV1       N2        140 18.083141       3 10.440307     40.00000  18.083141         3  10.440307
4      CV1       N3        138 12.529964       3  7.234178     38.00000  12.529964         3   7.234178
5      CV2       N0         95 11.532563       3  6.658328     55.00000  35.679126         3  20.599353
6      CV2       N1        123  5.291503       3  3.055050     41.00000  26.000000         3  15.011107
7      CV2       N2        151  5.196152       3  3.000000     48.66667   4.725816         3   2.728451
8      CV2       N3        164 12.124356       3  7.000000     64.00000  12.124356         3   7.000000

When I used the code you provide, error message popus up [Warning message: In names(cols)[missing_names] <- names[missing_names] : number of items to replace is not a multiple of replacement lengthl]. Could you tell me how to slove it? — Jin.w.Kim, Apr 19 '23 at 00:25

score 1 · Answer 2 · answered Apr 18 '23 at 20:52

You could simply use:

> summary(data1[,c("Yield", "Protein")])
     Yield          Protein     
 Min.   : 82.0   Min.   :14.00  
 1st Qu.:113.5   1st Qu.:31.50  
 Median :130.0   Median :45.00  
 Mean   :130.0   Mean   :45.71  
 3rd Qu.:150.2   3rd Qu.:55.50  
 Max.   :175.0   Max.   :79.00

Or for more details :

> library("EnvStats")
> summaryFull(data1[,c("Yield", "Protein")])

                             Protein    Yield
N                            24.0000  24.0000
Mean                         45.7100 130.0000
Median                       45.0000 130.0000
10% Trimmed Mean             45.4000 130.4000
Geometric Mean               41.8800 127.6000
Skew                          0.1883  -0.1933
Kurtosis                     -0.7934  -0.7482
Min                          14.0000  82.0000
Max                          79.0000 175.0000
Range                        65.0000  93.0000
1st Quartile                 31.5000 113.5000
3rd Quartile                 55.5000 150.2000
Standard Deviation           18.2100  24.8500
Geometric Standard Deviation  1.5650   1.2220
Interquartile Range          24.0000  36.7000
Median Absolute Deviation    17.7900  30.3900
Coefficient of Variation      0.3985   0.1912
attr(,"class")
[1] "summaryStats"
attr(,"stats.in.rows")
[1] TRUE
attr(,"drop0trailing")
[1] TRUE

Hope it answers your question.

How to summarize several independent variables at once in R?

2 Answers2