3

I am wondering what would be the best way to add, for example, columns of quantiles to the dataset. I was thinking to use ave() function for that, something like ave(iris$Sepal.Length, iris$Species, FUN=quantile) - but in this case ave() merges values returned by quantile() (which in this case returns 5 values per subset) and cut them for the length of iris...

Thanks in advance for suggestions!

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Vasily A
  • 8,256
  • 10
  • 42
  • 76

2 Answers2

4

There are a lot of SO questions on this general topic, recommending various uses of ave(), aggregate(), plyr(), reshape2::cast, or data.table depending on personal preference, readability, compactness, flexibility, speed ... Here's a simple solution with aggregate() that seems to do what you want:

(aa <- aggregate(Sepal.Length~Species,data=iris,quantile))

##      Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50% Sepal.Length.75%
## 1     setosa           4.300            4.800            5.000            5.200
## 2 versicolor           4.900            5.600            5.900            6.300
## 3  virginica           4.900            6.225            6.500            6.900
##   Sepal.Length.100%
## 1             5.800
## 2             7.000
## 3             7.900

edit: re-reading/based on comment, this is not what you want: you need the summarized values replicated for each row, not just once per group.

Perhaps

merge(iris,aa,by="Species")

although that gives a slightly weird data frame (the last "column" is actually a matrix).

It's a little bit magical, but

merge(iris,with(aa,data.frame(Species,Sepal.Length)))

is better -- it unpacks the weird data frame returned by aggregate() a bit more (the names are still a bit wonky).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Ben, thanks for rapid answer! I know about aggregate, but it returns a "summary" table without original values while I would like to keep the original table and only add the columns with quantiles (yes, there will be some redundancy in it). – Vasily A Jan 27 '13 at 20:03
  • 1
    variant with `merge` works perfectly (although initially I was trying to manage without it :]). But I should note that recast does not work correctly: if you look on first, say, 15 rows, you'll see that starting from row 11, quantiles values are taken from another subset. So, I will use `merge` variant. Thanks again! – Vasily A Jan 27 '13 at 20:22
  • 1
    P.S. @Ben, could you give me a hint on where I can read more about this "weird" form returned by `aggregate()`? It's not the first time I meet it, so I would like to understand how this matrix-in-a-column happens and how to deal with it in general. I don't want to create separate question as it seems to be something very basic - I just don't know right keywords to find this topic... – Vasily A Jan 27 '13 at 20:56
  • 1
    It would actually make a great question if you can figure out a way to formulate it. I don't know where it's documented, except obliquely in `?data.frame`: `If a list or data frame or matrix is passed to ‘data.frame’ it is as if each component or column had been passed as a separate argument (except for matrices of class "model.matrix" and those protected by "I").` It doesn't normally come up in user-generated data frames ... – Ben Bolker Jan 27 '13 at 21:28
3

With the data.table package:

library(data.table)
dt <- data.table(iris)
dt[, paste0("q", 25*(0:4)) := as.list(quantile(Sepal.Length)), by="Species"]
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • there seems to be a typo (missing parentheses?) - last line gives me an error `Error in \`[.data.frame\`(dt, , \`:=\`(paste0("q", 25 * (0:4)), as.list(quantile(Sepal.Length))), : unused argument(s) (by = "Species")` – Vasily A Jan 27 '13 at 20:31
  • oups, my fault: error was caused by viewing the table in RStudio. Thanks for another solution! – Vasily A Jan 27 '13 at 20:41