1

I need to calculate and add to a data frame multiple new columns based on the values in each column in a subset of columns in the data frame. These columns all hold time series data (there is a common date column). For example I need to calculate the change for the same month in the previous year for a dozen columns. I could specify them and calculate them individually but that becomes onerous with a large number of columns to transform, so I am trying to automate the process with a for loop.

I was doing OK until I tried to use ddply to create a column for the running total of the value for the year so far. What happens is that ddply is adding new rows during each iteration through the loop and including those new rows in the cumsum calculation. I have two questions.

Q. How can I get ddply to calculate the correct cumsum? Q. How can I specify the name of the column during the ddply call, rather than using a dummy value and renaming afterward?

[Edit: I spoke too soon, the updated code below does NOT work at this point, just FYI]

require(lubridate)
require(plyr)
require(xts)

set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
                   myvalue1 = runif(monthsback, min = 600, max = 800),
                   myvalue2 = runif(monthsback, min = 200, max = 300))

mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
newcolnames <- c('myvalue1','myvalue2')

for (i in seq_along(newcolnames)) {
    print(newcolnames[i])
    mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)
    ## Calculate change over same month in previous year
    mylag <- 12
    mydf[, paste(newcolnames[i], "_yoy", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))
    ## Calculate change over previous month
    mylag <- 1
    mydf[, paste(newcolnames[i], "_mom", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))

    ## Calculate cumulative figure
    #mydf$newcol <- as.numeric(mydf$myxts)
    mydf$newcol <- 1
    mydf <- ddply(mydf, .(year), transform, newcol = cumsum(as.numeric(mydf$myxts)))
    colnames(mydf)[colnames(mydf)=="newcol"] <- paste(newcolnames[i], "_cuml", sep = "", collapse = "")

}

mydf
SlowLearner
  • 7,907
  • 11
  • 49
  • 80
  • I am still struggling with this. The ddply call above is cumulatively summing data, but only for the first year, so that the rows in cumulatively summed data column for the second and third years contain the cumsum data for the first year. In effect, the first year is being copied to subsequent years. Can anybody help? – SlowLearner May 10 '12 at 13:58

1 Answers1

0

In your loop, since myxts is not part of the data frame, it is not split up in the ddply statement along with everything else. Change it to:

mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)

I don't know of any way to use dynamically generated names with transform.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • thanks for the response. I have moved the xts object into the data frame, but now I get: `Error in data.frame(list(mydate = c(14730, 14761, 14791, 14822, 14853, : arguments imply differing number of rows: 8, 24` which seems to be related to the `newcol =` assignment. – SlowLearner May 10 '12 at 10:37
  • This was resolved by coercing the xts object into a numeric before passing it to ddply. – SlowLearner May 10 '12 at 13:02