Summary Statistics in R

Question

How do I generate some summary statistics (mean, sd, range, sample size) for multiple categories (different measurements across row 1) from different species (in column 1) simultaneously and have them printed using "write.csv() to one data file. I can do so easy enough if I do it one species at a time but I would like to place all the data from all the species in one .csv file generate the sum stats all at once."

Welcome to StackOverflow! Please have a quick read of [how to ask](http://stackoverflow.com/help/how-to-ask) and check out [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Then, you could come back and edit your question, adding an example and some code to show what you tried and anything else that helps to clarify your question. — Jota, Feb 20 '17 at 20:59

xyz123 · Answer 1 · 2017-02-21T23:06:52.847

I know what you are talking about. Say you want to get the mean, standard deviation, range, and sample size. Because R gives the function range that doesn't give you a number but smallest number comma largest number in the dataset, it's giving me an issue. The magic is in tapply(). I just used transpose t() and as.matrix to make it easier to put into a data frame.

Anyway, take a look at the built in iris dataset.

data(iris)

I am going to give you the mean, sd, and sample size for all these with respect to Sepal Length only, write all values to rows of a dataframe with rbind, and then finally give the rows names with rownames().

Just do this:

mean_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, mean)))
mean_sepal_length

sd_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, FUN = sd)))
sd_sepal_length


sample_size_sepal_length = t(as.matrix(tapply(iris$Sepal.Length, iris$Species, FUN = length)))
sample_size_sepal_length


df_sepal_length <- data.frame(mean_sepal_length)
df_sepal_length

View(df_sepal_length)

df_sepal_length = rbind(df_sepal_length, sd_sepal_length)

df_sepal_length = rbind(df_sepal_length, sample_size_sepal_length)

rownames(df_sepal_length) <- c("Mean_sepal_length", "sd_sepal_length", "size_sepal_length")

write.csv(df_sepal_length, "C:/Users/me/Documents/tapply_miracle.csv")

Thanks a lot. I can get all those data for each species separately but when I want those data for multiples species in the same data matrix (.csv file) I'd like to do them all at a time rather than cut up the matrix into individual species specific data matrices to run separately. Is there any script for that? — L. Grismer, Feb 22 '17 at 01:02

score 0 · Answer 2 · answered Jun 24 '17 at 06:08

I was thinking about my answer that I gave back in the day, and I thought that it could have been better when I realized that the tapply function can accept the INDEX variable as a list. In my example, I was only aware that tapply could categorize one factor but we can specify multiple factors. The trick is to melt the iris dataframe from wide into long form which makes it more readable using the function melt(), and then tapply with a list argument:

       > install.packages("reshape2")
        > library(reshape2)

    # I used melt to restyle the iris dataframe from wide to long turning the many columns into rows with less columns, and I coerced the iris dataset back to a dataframe.   

        > iris_melt <- data.frame(melt(data = iris, id = "Species", variable.name = "iris_factors", value.name = "iris_dimensions_cm"))


   > head(iris_melt)
  Species iris_factors iris_dimensions_cm
1  setosa Sepal.Length                5.1
2  setosa Sepal.Length                4.9
3  setosa Sepal.Length                4.7
4  setosa Sepal.Length                4.6
5  setosa Sepal.Length                5.0
6  setosa Sepal.Length                5.4

Here we will get the mean flower dimensions of all the iris factors: Sepal Length, Sepal Width, Petal Length, & Petal Width across all Species (setosa, virginica, versicolor).

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$Species, iris_melt$iris_factors), FUN = mean)
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

If we change the order of the factors in the INDEXed list, we can get the same information presented to us in a slightly difference format by flipping the rows and columns:

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$iris_factors, iris_melt$Species), FUN = mean)
             setosa versicolor virginica
Sepal.Length  5.006      5.936     6.588
Sepal.Width   3.428      2.770     2.974
Petal.Length  1.462      4.260     5.552
Petal.Width   0.246      1.326     2.026

Getting the standard deviation is easy. Just change the FUN argument:

> tapply(X = iris_melt$iris_dimensions_cm, INDEX = list(iris_melt$iris_factors, iris_melt$Species), FUN = sd)
                setosa versicolor virginica
Sepal.Length 0.3524897  0.5161711 0.6358796
Sepal.Width  0.3790644  0.3137983 0.3224966
Petal.Length 0.1736640  0.4699110 0.5518947
Petal.Width  0.1053856  0.1977527 0.2746501

Now I don't have to use Rbind basically.

Summary Statistics in R

2 Answers2