5

I'm fairly new to R and I'm trying to use aggregate to perform some time series shaping on a dataframe, per subject and for each metric in my dataset. This works beautifully, but I find that the result isn't in a format that's very easy to use. I'd like to be able to transform the results back into the same format as the original dataframe.

Using the iris dataset as an example:

# Split into two data frames, one for metrics, the other for grouping
iris_species = subset(iris, select=Species)
iris_metrics = subset(iris, select=-Species)
# Compute diff for each metric with respect to its species
iris_diff = aggregate(iris_metrics, iris_species, diff)

I'm just using diff to illustrate that I have a function that shapes the time series, so I get a time series of possibly different length as a result and definitely not a single aggregate value (e.g. mean).

I'd like to transform the result, which seems to be a matrix that has list valued cells to the original "flat" dataframe.

I'm mostly curious about how to manage this with results from aggregate, but I'd be ok with solutions that do everything in plyr or reshape.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Vince Gatto
  • 415
  • 3
  • 13
  • You talk about times series..and you give an example based on iris?!! you aggregate using diff? what is the rationale..it would be better to use one of the time series packages (zoo,xts,..) !! really I don't understand what do you want to do.( I read your question at least 3 times) – agstudy Mar 01 '13 at 23:21
  • 1
    @agstudy, I *think* the question is a little bit to do with how `aggregate` behaves if you give it a function like `summary` or `fivenum` or something else that will return more than one column. In such cases, the result is what *looks* like a multi-column `data.frame`, but is actually a `matrix` as a column in a `data.frame`. Thus, a `do.call(data.frame, ...)` should do the trick to "flatten" the output. – A5C1D2H2I1M1N2O1R2T1 Mar 02 '13 at 09:54
  • @AnandaMahto thanks. I get your point. – agstudy Mar 02 '13 at 19:39

4 Answers4

2

As you might know, aggregate works on one column at a time. A single value is expected, and odd things happen if you return vectors of length different from 1.

You can split this up with by to get the data (with fewer rows than in iris) and put it back together:

b <- by(iris_metrics, iris_species, FUN=function(x) diff(as.matrix(x)))
do.call(rbind, lapply(names(b), function(x) data.frame(Species=x, b[[x]])))

diff(as.matrix) is used as this does what you want for matrices (but not for data frames). The key point is that the function returns a different number of rows than are in each Species in iris.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
2

The best solution I could think of in this case is data.table:

require(data.table)
dt <- data.table(iris, key="Species")
dt.out <- dt[, lapply(.SD, diff), by=Species]

And if you want a plyr solution, then the idea is basically the same. Split by Species and apply diff to each column.

require(plyr)
ddply(iris, .(Species), function(x) do.call(cbind, lapply(x[,1:4], diff)))
Arun
  • 116,683
  • 26
  • 284
  • 387
1

If you wanted to return some sort of first differences vector with the same length as the input vector, you should be doing so with ave and an anonymous function. Since diff returns a vector of a different length you need to extend it with NA (or a marker of your choosing).

iris_diff = lapply(iris_metrics, 
        function(xx) ave(xx, iris_species, FUN=function(x) c(NA, diff(x) ) )  )
str(iris_diff)
#--------------
List of 4
 $ Sepal.Length: num [1:150] NA -0.2 -0.2 -0.1 0.4 ...
 $ Sepal.Width : num [1:150] NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
 $ Petal.Length: num [1:150] NA 0 -0.1 0.2 -0.1 ...
 $ Petal.Width : num [1:150] NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...

If you wanted that as a dataframe just wrap data.frame around it. And it would be a good idea to include the original grouping vector:

iris_diff <- data.frame( Species= iris_species, iris_diff)
str(iris_diff)
#------
'data.frame':   150 obs. of  5 variables:
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.Length: num  NA -0.2 -0.2 -0.1 0.4 ...
 $ Sepal.Width : num  NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
 $ Petal.Length: num  NA 0 -0.1 0.2 -0.1 ...
 $ Petal.Width : num  NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Here's what I understand as your problem: With your current method of using aggregate, you get a matrix for the results of "Sepal.Length", "Sepal.Width", and so on.

> str(iris_diff)
'data.frame':   3 obs. of  5 variables:
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ Sepal.Length: num [1:3, 1:49] -0.2 -0.6 -0.5 -0.2 0.5 ...
 $ Sepal.Width : num [1:3, 1:49] -0.5 0 -0.6 0.2 -0.1 0.3 -0.1 -0.8 -0.1 0.5 ...
 $ Petal.Length: num [1:3, 1:49] 0 -0.2 -0.9 -0.1 0.4 ...
 $ Petal.Width : num [1:3, 1:49] 0 0.1 -0.6 0 0 0.2 0 -0.2 -0.3 0 ...

But, in your console, that displays as what looks like a data.frame with 197 columns.

You want to convert "iris_diff" to a data.frame with 197 columns. Here's how you can do that with your existing output (a trick I picked up from @James, here on SO):

do.call(data.frame, iris_diff)

Here are the first few lines of the output when we view the str of that action:

> str(do.call(data.frame, iris_diff))
'data.frame':   3 obs. of  197 variables:
 $ Species        : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ Sepal.Length.1 : num  -0.2 -0.6 -0.5
 $ Sepal.Length.2 : num  -0.2 0.5 1.3
 $ Sepal.Length.3 : num  -0.1 -1.4 -0.8
 $ Sepal.Length.4 : num  0.4 1 0.2
 $ Sepal.Length.5 : num  0.4 -0.8 1.1
 $ Sepal.Length.6 : num  -0.8 0.6 -2.7
 $ Sepal.Length.7 : num  0.4 -1.4 2.4
 $ Sepal.Length.8 : num  -0.6 1.7 -0.6
 $ Sepal.Length.9 : num  0.5 -1.4 0.5
 $ Sepal.Length.10: num  0.5 -0.2 -0.7
Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485