Function to count consecutive non-zero values

Question

With a dataset that looks like this:

Index                  x      y

2012-07-24 07:00:00   0.1     0
2012-07-24 07:15:00   0.2     1
2012-07-24 07:30:00   0.3     0
2012-07-24 07:45:00   0.2     0
2012-07-24 08:00:00   0.3     1
2012-07-24 08:15:00   0.4     1
2012-07-24 08:30:00   0.4     1
2012-07-24 08:45:00   0.5     0
2012-07-24 09:00:00   0.6     0

I currently have a for loop that goes through this large dataset and essentially calculates y/x each time y != 0. An additional calculation I am trying to make is the average x within an event, where an event is defined as any consecutive string in y of non-zero values. Using the dataset I have provided above, there are two events: one of length 1, and one of length 3.

For the second event of length 3, in my for loop, I would like to return three values, one for each step through the loop: (0.3), (0.3+0.4)/2, and (0.3+0.4+0.4)/3.

I am having trouble determining what the most efficient way to do this. I have looked at some previous posts that mainly are looking for finding indices of zero-values using rle and patterns like which(x !== 0).

Any help is greatly appreciated.

akrun · Answer 1 · 2015-10-01T17:05:09.223

2

Based on the calculation provided in the OP's post, may be this helps. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the run-length type id of 'y' that is not equal to 0, we divide the cumusm of 'x' where 'y' is not equal to 0 by the sequence of 'x' for 'y' not 0, and assign (:=) that as a new column ('new1').

library(data.table)#v1.9.6+
setDT(df1)[,new1:=cumsum(x[y!=0])/seq_along(x)[y!=0] ,.(rleid(y!=0))]
df1
#                 Index   x y      new1
#1: 2012-07-24 07:00:00 0.1 0        NA
#2: 2012-07-24 07:15:00 0.2 1 0.2000000
#3: 2012-07-24 07:30:00 0.3 0        NA
#4: 2012-07-24 07:45:00 0.2 0        NA
#5: 2012-07-24 08:00:00 0.3 1 0.3000000
#6: 2012-07-24 08:15:00 0.4 1 0.3500000
#7: 2012-07-24 08:30:00 0.4 1 0.3666667
#8: 2012-07-24 08:45:00 0.5 0        NA
#9: 2012-07-24 09:00:00 0.6 0        NA

EDIT: Or use cummean instead

setDT(df1)[,new1:= cummean(x[y!=0]), .(rleid(y!=0))]

edited Oct 01 '15 at 17:05

answered Oct 01 '15 at 16:42

akrun

874,273
37
540
662

Thanks for your suggestion. One important thing I forgot to mention that affects your code suggestion is that y is not always 1, it can be any non-zero number. Would that simply just change all of the "y==1" to "y!=0"? – user2943039 Oct 01 '15 at 17:03
@user2943039 In that case change it to `y!=0` and it should work. – akrun Oct 01 '15 at 17:05
@user2943039 Updated the post. Earlier, I forgot to change `y!=0` in the `rleid`. Tested on an example data and it is working. – akrun Oct 01 '15 at 17:10
This code does not work for me for some reason. One issue is that my data is an xts, which appears to not be something that you can read through data.table. So, I tried making it "as.data.frame", but it still didn't produce any results. Also, I want this to be in a for loop, whereas this code doesn't have any index for the for loop in it. – user2943039 Oct 02 '15 at 15:59
1

I seem to have found a workaround, please ignore previous comment. Thanks again for your help! – user2943039 Oct 02 '15 at 16:06

Function to count consecutive non-zero values

1 Answers1