4

Related to this question here, but I decided to ask another question for the sake of clarity as the 'new' question is not directly related to the original. Briefly, I am using ddply to cumulatively sum a value for each of three years. My code takes data from the first year and repeats in in the second and third-year rows of the column. My guess is that each 1-year chunk is being copied to the whole of the column, but I don't understand why.

Q. How can I get a cumulatively summed value for each year, in the right rows of the designated column?

[Edit: the for loop - or something similar - is important, as ultimately I want to automagically calculate new columns based on a list of column names, rather than calculating each new column by hand. The loop iterates over the list of column names.]

enter image description here

I use the ddply and cumsum combination frequently so it is rather vexing to suddenly be having problems with it.

[Edit: this code has been updated to the solution I settled on, which is based on @Chase's answer below]

require(lubridate)
require(plyr)
require(xts)
require(reshape)
require(reshape2)

set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
                   myvalue1 = runif(monthsback, min = 600, max = 800),
                   myvalue2 = runif(monthsback, min = 1900, max = 2400),
                   myvalue3 = runif(monthsback, min = 50, max = 80),
                   myvalue4 = runif(monthsback, min = 200, max = 300))

mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))

# Select columns to process
newcolnames <- c('myvalue1','myvalue4','myvalue2')

# melt n' cast
mydf.m <- mydf[,c('mydate','year',newcolnames)]
mydf.m <- melt(mydf.m, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
mydf.m <- dcast(mydate ~ variable, data = mydf.m, value.var = "newcol")
colnames(mydf.m) <- c('mydate',paste(newcolnames, "_cum", sep = ""))
mydf <- merge(mydf, mydf.m, by = 'mydate', all = FALSE)
mydf
Community
  • 1
  • 1
SlowLearner
  • 7,907
  • 11
  • 49
  • 80

1 Answers1

4

I don't really follow your for loop there, but are you overcomplicating things? Can't you just directly use transform and ddply?

#Make sure it's ordered properly
mydf <- mydf[order(mydf$year, mydf$month),]

#Use ddply to calculate the cumsum by year:
ddply(mydf, "year", transform, 
      cumsum1 = cumsum(myvalue1), 
      cumsum2 = cumsum(myvalue2))
#----------
       mydate myvalue1 myvalue2 year month   cumsum1   cumsum2
1  2010-05-01 744.1808 264.4543 2010     5  744.1808  264.4543
2  2010-06-01 775.1546 238.9828 2010     6 1519.3354  503.4371
3  2010-07-01 752.1965 269.8544 2010     7 2271.5319  773.2915
....
9  2011-01-01 745.5411 218.7712 2011     1  745.5411  218.7712
10 2011-02-01 797.9474 268.1834 2011     2 1543.4884  486.9546
11 2011-03-01 606.9071 237.0104 2011     3 2150.3955  723.9650
...
21 2012-01-01 690.7456 225.9681 2012     1  690.7456  225.9681
22 2012-02-01 665.3505 232.1225 2012     2 1356.0961  458.0906
23 2012-03-01 793.0831 206.0195 2012     3 2149.1792  664.1101

EDIT - this is untested as I don't have R on this machine, but this is what I had in mind:

require(reshape2)
mydf.m <- melt(mydf, measure.vars = newcolnames)
mydf.m <- ddply(mydf.m, c("year", "variable"), transform, newcol = cumsum(value))
dcast(mydate + year + month  ~ variable, data = mydf.m, value.var = "newcol")
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Thank you @Chase. I can and do use `ddply` and `transform` directly (and successfully) when I'm dealing with a small static group, say 2-3 columns. Then yesterday I found that I need to do this for 12 data series, leading me to the conclusion that my current approach of direct coding each value just doesn't scale and needs to be rethought. The `for` loop is my attempt to automate the construction of columns holding these yearly running totals (and various other common calculations as well). – SlowLearner May 10 '12 at 15:21
  • @SlowLearner - gotcha. The first thing that came to mind would be to `melt` your data into long format, then do your calculation with `ddply` grouping on `year` and `variable`, then `cast` back into wide format – Chase May 10 '12 at 16:42
  • thanks for the suggestion. I'm having problems conceptualising what you mean. I guess you mean junk the `for` loop... I have tried `mydf <- melt(mydf, id = c('mydate','year','month')) mydf$newcol <- 1 mydf <- ddply(mydf, .(year, variable), transform, newcol = cumsum(value)) colnames(mydf)[colnames(mydf)=="newcol"] <- paste(variable, "_cuml", sep = "", collapse = "") mydf <- cast(mydf, mydate ~ variable + newcol)` which seems to work, except that I can't quite work out the final cast to get `newcol` back to wide format. Would you be kind enough to help? – SlowLearner May 10 '12 at 22:23
  • @SlowLearner - see my edits. I'm not at an R terminal at the moment, but that's what I had in mind! – Chase May 11 '12 at 02:39
  • thank you, that does indeed create a new data frame with two columns containing cumulative sums for the original columns. I think I can take this, rename the new columns and merge it into the original data frame, so this is very helpful. Is there perhaps a way when using dcast to append the new columns to the original data frame rather having it replace the original columns with the newly calculated columns? (I've had a nose through the help files but couldn't see anything obvious.) – SlowLearner May 11 '12 at 07:49