1

I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:

enter image description here

And here's it is for your console:

library(xts)

date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
              "2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
              "2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
              "2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
              "2011-03-26", "2011-03-27"))

returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
         -0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
         0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
          0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
          0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)

xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)

I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!

UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!

xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
  if(is.na(xtsData[i,"count"]) == FALSE){
    xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
  }
  else{
   xtsData[i,"mySumCount"] <- NA
  }
}

UPDATE 2: thank you commenters!

# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]

# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"]) 

# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"]) 

enter image description here

enter image description here

SOLUTION:

# function to pad a vector with a 0
lagpad <- function(x, k) {
  c(rep(0, k), x)[1 : length(x)] 
}

# group the counts
x1 <- na.omit(transform(xtsData, g =  cumsum(c(0, diff(!is.na(count)) == 1))))

# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))



 merge(xtsData,z1$cumsumRet,z2$maxRet)

enter image description here

jonnie
  • 745
  • 1
  • 13
  • 22
  • thank you @DavidArenburg, but what you see there is all hard coded. input by hand. i need help calculating maxCount and sumCount – jonnie Aug 17 '14 at 18:08
  • 1
    What is the input? the output? how is the output calculated from the input? What defines a "count series"? Also, show what you have tried. – G. Grothendieck Aug 17 '14 at 18:13
  • @G.Grothendieck OK, give me a minute to input my for-loop that i was using to calculate a the sumCount. the maxCount I have no idea. to clarify inputs and outputs, the input is the count data. a count series is a range of counts, bounded by the NA's. so in the example above, there are 2 count series, one that is 1:7 and the other is 1:2. the output calculated is based off the "returns" column, but conditionally so only during periods where the count time-series is not NA. – jonnie Aug 17 '14 at 18:27
  • @G.Grothendieck posted my for-loop which might help you understand the inputs better. – jonnie Aug 17 '14 at 19:01
  • Ok, do you need help calculating the count too? Or is the count accurate? – mgriebe Aug 17 '14 at 19:07
  • It is not clear what you are trying to do. Your for loop is just copying the returns column into a mySumCount column, skipping NAs. You could accomplish the same by doing this: xtsData[!is.na(xtsData$count),"mySumCount2"]<-xtsData[!is.na(xtsData$count),"returns"] – mgriebe Aug 17 '14 at 19:19
  • So what do you want the sumCount to do? add up the returns? – mgriebe Aug 17 '14 at 19:24
  • 1
    xtsData[!is.na(xtsData$count),"mySumCount2"]<-cumsum(xtsData[!is.na(xtsData$count),"returns"]) – mgriebe Aug 17 '14 at 19:25
  • 1
    calculate the max: xtsData[!is.na(xtsData$count),"myMax2"]<-max(xtsData[!is.na(xtsData$coun‌​t),"returns"]) – mgriebe Aug 17 '14 at 19:26
  • @mgriebe very cool! you are very close with the cumSum, don't know about your max, because i see .03 as the max of the entire series, not just the count series. unless this is a round error, the output for the max of the second count should be .27. i also see .03 being displayed, even when counts are not being displayed. please see my update number 2! and thank you very much! – jonnie Aug 17 '14 at 20:02

2 Answers2

3

The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.

To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.

The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:

x <- na.omit(transform(x, g =  cumsum(c(0, diff(!is.na(count)) == 1))))

giving:

> x
           returns count maxCount sumCount g
2010-11-26  -0.009     1    0.030    0.000 1
2010-12-03   0.030     1    0.030    0.030 1
2010-12-10   0.013     2    0.030    0.042 1
2010-12-17   0.003     2    0.030    0.045 1
2010-12-24   0.010     3    0.030    0.056 1
2010-12-31   0.001     4    0.030    0.056 1
2011-01-07   0.011     5    0.030    0.067 1
2011-01-14   0.017     6    0.030    0.084 1
2011-01-21  -0.008     7    0.030    0.077 1
2011-01-28  -0.005     7    0.030    0.071 1
2011-02-04   0.027     7    0.030    0.098 1
2011-02-11   0.014     7    0.030    0.112 1
2011-02-18   0.010     7    0.030    0.123 1
2011-03-18   0.027     1    0.027    0.000 2
2011-03-25  -0.019     2    0.027   -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27 
         1          2         16         17         18         21         22 
attr(,"class")
[1] "omit"

You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:

transform(x, cumsumRet = ave(returns, g, FUN = cumsum))

Replace cumsum with any other function that is suitable for use with ave.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • this output is very close, please see my second update, and thanks for showing me the data.table package! – jonnie Aug 17 '14 at 20:21
  • This does not use the data.table package. Also no explanation has been provided still on how to compute the values. IF you want to `cumsum` the returns of each group except set the first value is to be set to 0 then replace `cumsum` with `function(x) replace(cumsum(x), 1, 0)` in `ave` line. – G. Grothendieck Aug 17 '14 at 20:25
  • yeah, sorry, wrong answer i replied to. and sorry if the calculations weren't clear. believe it or not, we managed to solve the problem! the magic was in the first line of code that you provided above. i'll clarify in the 3rd update. thank you very much!! – jonnie Aug 17 '14 at 20:50
  • 1
    The code at the end of the answer is almost certain wrong. It creates `x1` and `x2` both of which have `g` columns which are subsequently ignored in the `merge`. The main reason no one can help you is that even after requesting an explanation multiple times you have not provided it. – G. Grothendieck Aug 17 '14 at 21:11
  • wow. my mistake. you got me on the right track, i still need to understand better what is going on, but i think have an actual solution.. not just ignore variables... – jonnie Aug 17 '14 at 22:47
1

Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.

library(xts)
library(data.table)

date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
                  "2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
                  "2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
                  "2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
                  "2011-03-26", "2011-03-27"))

returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
             -0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
             0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
              0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
              0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)

DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]

#if you need an xts object at the end, then.

xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)
mgriebe
  • 908
  • 5
  • 8