How to find correlation for every n rows in R?

Question

Following is my data frame lanec:

read.table(textConnection(scan(,character(),sep="\n")))
   vehicle.id frame.id svel PrecVehVel
1           2        1   55         59
2           2        2   55         59
3           2        3   53         57
4           2        4   50         54
5           2        5   48         52
6           3        3   49         53
7           3        4   55         59
8           3        5   55         59
9           3        6   43         47
10          3        7   45         49
11          3        8   52         56
12          3        9   50         54
13          4        1   38         42
14          4        2   42         46
15          4        3   45         49
16          4        4   48         52
17          4        5   50         54
18          4        6   52         56
19          4        7   55         59
20          5        6   49         53
21          5        7   52         56
22          5        8   54         58
23          5        9   58         62
24          5       10   60         64
25          5       11   63         67
26          5       12   70         74

<Carriage return>

I want to find correlation cor between svel and PrecVehVel (vehicle's velocity and preceding vehicle's velocity respectively) by vehicle.id for every 3 rows but for consecutive rows. This means that in the data frame lanec for vehicle.id==2, R should first find correlation between

   svel PrecVehVel
1    55         59
2    55         59
3    53         57

svel(55,55,53) and PrecVehVel(59,59,57), then start again from the second row and find correlation between

   svel PrecVehVel
2    55         59
3    53         57
4    50         54

svel(55,53,50) & PrecVehVel(59,57,54) and so on.

The output should be something like this:

vehicle.id     frames     speed.cor
2               1 - 3     1
2               2 - 4     1
2               3 - 5     1
2               4 - 5     1

Note that the last entry in frames column has only 2 frames for which the correlation was found because there was no more data for vehicle 2. The best I could do with my limited knowledge of R was following:

ddply(lanec, 'vehicle.id', summarize, speed.cor = cor(svel, PrecVehVel) )

But this clearly doesn't meet the goal because it finds the correlation for all the rows for a vehicle.id

Sort of. The original data frame is about vehicle trajectories (speed, distance, etc) at every 0.1 seconds (length of one frame). — umair durrani, May 15 '14 at 22:39
In my experience, functions like ``ddply``, ``ave``, ``apply`` really struggle when datasets get big. Does it seem to be working for your larger dataset? — Mike.Gahan, May 16 '14 at 12:46
@Mike.Gahan `ddply` works really good when I use `summarize` in it. Using `for loop` is faster in my experience than using `transform` in `ddply`.@Henrik's solution works for me but I want to use it to `summarize` so that I know which correlation value is for which interval of frames similar to output described in the question. If I use it as is it becomes almost impossible to determine the correlation value for a particular interval of frames. — umair durrani, May 16 '14 at 13:54

Henrik · Answer 1 · 2014-05-16T14:06:25.850

1

You may try a rolling correlation using rollapply from package zoo:

library(zoo)
ddply(lanec, 'vehicle.id', function(dat){
  dat$speed.cor = rollapply(data = dat, width = 3,
                            FUN = function(x) cor(x[ , "svel"], x[ , "PrecVehVel"]),
                            by.column = FALSE, fill = NA)
  dat
})

Note that the window width is fixed to 3. Thus, this alternative will not give you the last '2 frame' correlation.

Edit following comment from OP. This may be closer to your desired output. I keep the first and last frame in separate columns. Easy to paste together if you wish.

ddply(lanec, 'vehicle.id', function(dat){
  speed.cor <- rollapply(data = dat, width = 3,
                         FUN = function(x) cor(x[ , "svel"], x[ , "PrecVehVel"]),
                         by.column = FALSE)
  len <- length(speed.cor)
  vehicle.id <- head(dat$vehicle.id, len)
  first.frame.id <- head(dat$frame.id, len)
  last.frame.id <- tail(dat$frame.id, len)
  data.frame(vehicle.id, first.frame.id, last.frame.id, speed.cor)
})

#    vehicle.id first.frame.id last.frame.id speed.cor
# 1           2              1             3         1
# 2           2              2             4         1
# 3           2              3             5         1
# 4           3              3             5         1
# 5           3              4             6         1
# 6           3              5             7         1
# 7           3              6             8         1
# 8           3              7             9         1
# 9           4              1             3         1
# 10          4              2             4         1
# 11          4              3             5         1
# 12          4              4             6         1
# 13          4              5             7         1
# 14          5              6             8         1
# 15          5              7             9         1
# 16          5              8            10         1
# 17          5              9            11         1
# 18          5             10            12         1

edited May 16 '14 at 14:06

answered May 15 '14 at 21:55

Henrik

65,555
14
143
159

Losing the '2 frame' is probably not a big deal since the correlation between two points is always 1. – Neal Fultz May 15 '14 at 22:24
Thank you very much. There is one problem with original data set. When I use `width=11` or more in `rollapply` it gives me following error: `Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument`. I actually need the `width=50` because each frame is 0.1 seconds long and using 50 frames I want to find correlation for 5 seconds velocity data – umair durrani May 16 '14 at 20:13
At least the [**first hit on google**](https://stat.ethz.ch/pipermail/r-sig-finance/2012q2/010000.html) for the error suggests that `width` is larger than number of registration for at least one vehicle.id – Henrik May 16 '14 at 20:21
That is true. I have also checked the minimum number of frames in original data frame which is 5. The code, however, worked with `width=10` – umair durrani May 16 '14 at 20:37
But `width = 11` gives `Error: wrong sign in 'by' argument` (_only_ that, perhaps you have an older `zoo`. I have zoo_1.7-11). – Henrik May 16 '14 at 20:39
I have same `zoo` version. This means that `rollapply` has a limitation on width i.e. it cannot be greater than 10. – umair durrani May 16 '14 at 20:48
You may try `TTR::runCor` (see e.g. [**here**](http://stackoverflow.com/questions/20690809/calculate-rolling-correlation-using-rollapply)) to see if you get a more informative error message. – Henrik May 16 '14 at 21:05
The help for `rollapply` tells something about `width` and `by` in `Detail`s section but that is completely alien language for me. Anyway, thanks. I think I should try Mike's code which I was trying to avoid because of data.table – umair durrani May 16 '14 at 21:06

Neal Fultz · Answer 2 · 2014-05-15T22:11:12.293

This makes a matrix of which rows should get used in each group:

> bandSparse(5,4,-2:0)
5 x 4 sparse Matrix of class "ngCMatrix"

[1,] | . . .
[2,] | | . .
[3,] | | | .
[4,] . | | |
[5,] . . | |

We can apply over it's columns for each vehicle:

> vehicle2 <- subset(lanec, vehicle.id == 2)
> apply(bandSparse(5,4,-2:0), 2, function(i) list(r= cor(vehicle2[i,3:4]) ) )

Then we can do it for each vehicle:

by(lanec vehicle$vehicle.id, function(x) apply(bandSparse(nrow(x),nrow(x)-1,-2:0), 2, function(i) list(r= cor(x[i,3:4]) ) ) )

Mike.Gahan · Accepted Answer · 2014-05-21T23:10:18.820

This is a tricky problem. I have found that most base R solutions for these "rolling calc" type of problems are way way too slow for data of any significant size. I have had lots of luck by using the data.table package (especially if speed is an issue). I included the parallel package just in case you have a ton of observations and you need to do this faster. It is set to mc.cores=1 right now, but if you are running Mac or Linux, you can obviously up it.

lanec <- structure(list(vehicle.id = c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L), frame.id = c(1L, 2L, 3L, 4L, 5L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L), svel = c(55, 75, 53, 50, 32, 49, 55, 55, 43, 45, 52, 50, 
38, 42, 45, 48, 50, 52, 55, 49, 52, 54, 58, 60, 63, 70), PrecVehVel = c(59, 
59, 57, 54, 52, 53, 59, 59, 47, 49, 56, 54, 42, 46, 49, 52, 54, 
56, 59, 53, 56, 58, 62, 64, 67, 74)), .Names = c("vehicle.id", 
"frame.id", "svel", "PrecVehVel"), class = "data.frame", row.names = c(NA, 
-26L))

#Load data.table package
require("data.table")
require("parallel")
data <- data.table(lanec)

#What length of correlation vector do you want?
cor.vec <- 2

##Assign each customer an ID
data[,ID:=.GRP,by=c("vehicle.id")]

##Group values at the list level
Ref <- data[,list(frame.id=list(I(frame.id)),svel.list=list(I(svel)),PrecVehVel.list=list(I(PrecVehVel))),by=list(vehicle.id)]

#Calculate rolling calculation
data$Roll.Corr <- mcmapply(FUN = function(RD, NUM) {

  #mcmapply is a multicore version of mapply. If running Linux or Mac, you can up the amount of cores and have the code run faster

  #d is the difference between the current frame.id and all other frame.id's within the   same vehicle id.
  d <- (Ref$frame.id[[NUM]] - RD)

  #The following checks whether d is within the "window" you want. If not in the desired "window", then svel1 and prec1 will have zero values. If in desired "window", then its value will be the respective "svel" and "prec" value in original data.
  svel1 <- (d >= 0 & d <= cor.vec)*Ref$svel.list[[NUM]]
  prec1 <- (d >= 0 & d <= cor.vec)*Ref$PrecVehVel.list[[NUM]]

  #Following discards all data points not in sliding "window" (deletes all of the zeros)
  keep <- which(d >= 0 & d <= cor.vec)
  svel1 <- svel1[keep]
  prec1 <- prec1[keep]

  #Following makes sure a correlation value is only provided if the number of points within the window is larger than the correlation "window" length
  if (length(svel1)>cor.vec){
      cor(svel1,prec1) 
    } else {
      NA
    }   
}, RD = data$frame.id,NUM=data$ID,mc.cores=1)

#Print data
data[,frame.start:=ifelse(is.na(Roll.Corr),NA,frame.id)]
data[,frame.end:=ifelse(is.na(Roll.Corr),NA,frame.id+cor.vec)]
head(data,10)

    vehicle.id frame.id svel PrecVehVel ID Roll.Corr frame.start frame.end
 1:          2        1   55         59  1 0.5694948           1         3
 2:          2        2   75         59  1 0.8635894           2         4
 3:          2        3   53         57  1 0.8746393           3         5
 4:          2        4   50         54  1        NA          NA        NA
 5:          2        5   32         52  1        NA          NA        NA
 6:          3        3   49         53  2 1.0000000           3         5
 7:          3        4   55         59  2 1.0000000           4         6
 8:          3        5   55         59  2 1.0000000           5         7
 9:          3        6   43         47  2 1.0000000           6         8
10:          3        7   45         49  2 1.0000000           7         9

Thank you very much. Your code helped me achieving the results. Could you please add some comments in `Calculate Rolling Calculation` section to elaborate what's going on in each step? — umair durrani, May 21 '14 at 22:53
Just added comments. Could you tell me how fast it ran? (and on how big a dataset)? — Mike.Gahan, May 21 '14 at 23:13
A bundle of thanks for extra comments. It about 4 minutes. I am using Windows 7 with 32 GB RAM. The data set has more than 400,000 rows and 25 columns. — umair durrani, May 21 '14 at 23:20
sure thing! I was hoping it would be faster, but I guess that is ok considering you are only using 1 core. — Mike.Gahan, May 21 '14 at 23:30

How to find correlation for every n rows in R?

3 Answers3