0

I wish to calculate the correlation coefficients of 12 variables using a running window in R.

My data is stored in a zoo object with a %m.%d.%Y %H:%M:%S index, each of the 12 variables have 1343 observations. I don't know what window size I'm going to use but I can change it if needs.

@Joshua Ulrich has posted here how to calculate the rolling correlation using rollapplyr but this example has only two variables. With my limited R experience I am unsure as to how I can incorporate one of the apply family functions to run the correlations across all 12 variables.

Any pointers would be greatly appreciated

My data looks as follows:

> head(wideRawXTS)
                    DO0182U09A3 DO0182U09B3 DO0182U09C3 DO0182U21A1 DO0182U21A2 DO0182U21A3 DO0182U21B1 DO0182U21B2 DO0182U21B3
2017-01-20 16:30:00     -101.50     -103.37     -103.86     -104.78     -104.95     -105.33     -102.50      -99.43     -104.05
2017-01-20 16:45:00     -101.32     -102.75     -104.22     -104.51     -103.94     -105.29     -102.82     -101.99     -103.94
2017-01-20 17:00:00     -101.45     -103.30     -103.93     -104.70     -104.82     -105.13     -103.72     -103.95     -104.25
2017-01-20 17:15:00     -100.91      -95.92      -99.22     -103.83     -104.72     -105.19     -103.57     -101.36     -104.09
2017-01-20 17:30:00     -100.91     -103.04     -104.09     -102.15     -104.91     -105.18     -103.88     -104.09     -103.96
2017-01-20 17:45:00     -100.97     -103.67     -104.12     -105.07     -104.23      -97.48     -103.92     -103.89     -104.01
                    DO0182U21C1 DO0182U21C2 DO0182U21C3
2017-01-20 16:30:00     -104.51     -104.42     -105.17
2017-01-20 16:45:00     -104.74     -104.65     -105.25
2017-01-20 17:00:00     -105.02     -105.04     -105.32
2017-01-20 17:15:00     -103.90     -102.95     -105.16
2017-01-20 17:30:00     -104.75     -105.07     -105.23
2017-01-20 17:45:00     -105.08     -105.14     -104.89

Cor Output of wideRawDFscaled

enter image description here

#Importing Data
rawDF <- read.csv("RTWP_DO0182_14Day_Raw.csv",
                  header = F,
                  col.names = c("Period Start Time",
                                "PLMNname",
                                "RNCname",
                                "WBTSname",
                                "WBTSID",
                                "WCELname",
                                "WCELID",
                                "Overload",
                                "AverageRTWP",
                                "DC_RTWP_High_PRX",
                                "RTWP_Threshold"),
                  colClasses = c("character",
                                 "NULL", #drops PLMN name
                                 "NULL", #drops RNC name
                                 "character", 
                                 "integer",
                                 "character",
                                 "integer",
                                 "NULL", #drops Overload
                                 "numeric",
                                 "NULL", #drops DC_RTWP_High_PRX
                                 "NULL"), #drops RTWP_Threshold
                  strip.white=TRUE,
                  na.strings = c("NA", 
                                 "NULL"),
                  as.is = T,
                  skip=2)

#convert period.start.time to POSIXlt
rawDF$Period.Start.Time <- as.POSIXlt(strptime(rawDF$Period.Start.Time, 
                                               format="%m.%d.%Y %H:%M:%S"))

#dcast the long data frame to a wide data frame
wideRawDF <- dcast(rawDF, 
                   Period.Start.Time ~ WCELname,
                   value.var = "AverageRTWP")

#assign the date times as rownames for converting to XTS
rownames(wideRawDF) = wideRawDF[[1]]

#drops the duplicate Period Start Time column since date times are rownames
wideRawDF$Period.Start.Time <- NULL

#Convert wideRawDF to XTS time series
wideRawDF <- as.xts(wideRawDF)

#NA values replaced by interpolated values
wideRawDF <- na.approx(wideRawDF, na.rm = FALSE)

#DF is centered by subtracting the column means and scaled by dividing the
#centered columns by their standard deviations
wideRawDFscaled <- scale(wideRawDF, center = TRUE, scale = TRUE)

#window <- 20
#cors <- combn(names(wideRawDFscaled),2,
#              function(p) rollapply(wideRawDFscaled, window ,
#                                    function(x) cor(x[,p[1]],x[,p[2]]),
#                                    by.column = FALSE))
#colnames(cors) <- combn(names(wideRawDFscaled),2,paste,collapse=".")

#Running window of correlation coefficients
Cor <- function(x) {
  corr <- cor(wideRawDFscaled)
  out <- as.data.frame.table(corr)[lower.tri(corr), ]
  with(out, setNames(Freq, paste(Var1, Var2)))
}
slidingCor <- rollapplyr(wideRawDFscaled, 6, Cor, by.column = FALSE)
Community
  • 1
  • 1
TheGoat
  • 2,587
  • 3
  • 25
  • 58

2 Answers2

1

You can use combn to apply the correlation to each pair of columns. Adapting the answer you referenced:

window <- 20
cors <- combn(colnames(wideRawXTS),2,
              function(p) rollapply(wideRawXTS, window  ,
                                    function(x) cor(x[,p[1]],x[,p[2]]), 
                                    by.column=FALSE))
colnames(cors) <- combn(colnames(wideRawXTS),2, paste, collapse=".")
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • thanks for taking the time to reply. I have an error that I can't figure out, `> colnames(cors) <- combn(names(wideRawDFscaled),2,paste,collapse=".") Error in `colnames<-`(`*tmp*`, value = c("DO0182U09A3.DO0182U09B3", "DO0182U09A3.DO0182U09C3", : length of 'dimnames' [2] not equal to array extent` – TheGoat Mar 09 '17 at 22:49
  • Try replacing `names`with `colnames` – HubertL Mar 09 '17 at 22:53
  • please dput your data - I can't reproduce the error – HubertL Mar 09 '17 at 23:53
1

Assuming the 3 column input shown reproducibly in the Note and using a window of width 3 for illustration define the correlation function Cor that accepts a matrix and computes its correlation matrix extracting the lower triangular part to eliminate redundancy and adding column names. Now use that with rollapplyr:

library(xts)

Cor <- function(x) {
    corr <- cor(x)
    out <- as.data.frame.table(corr)[lower.tri(corr), ]
    with(out, setNames(Freq, paste(Var1, Var2)))
}
rollapplyr(xx, 3, Cor, by.column = FALSE)

giving:

                    DO0182U09B3.DO0182U09A3 DO0182U09C3.DO0182U09A3 DO0182U09C3.DO0182U09B3
2017-01-20 16:30:00                      NA                      NA                      NA
2017-01-20 16:45:00                      NA                      NA                      NA
2017-01-20 17:00:00                 0.98573                -0.99613                -0.99671
2017-01-20 17:15:00                 0.98629                 0.95983                 0.99297
2017-01-20 17:30:00                 0.52664                 0.47475                 0.99820
2017-01-20 17:45:00                 0.56204                 0.50460                 0.99769

Note: Input xx in reproducible form is:

xx <- structure(c(-101.5, -101.32, -101.45, -100.91, -100.91, -100.97, 
-103.37, -102.75, -103.3, -95.92, -103.04, -103.67, -103.86, 
-104.22, -103.93, -99.22, -104.09, -104.12), .Dim = c(6L, 3L), .Dimnames = list(
    NULL, c("DO0182U09A3", "DO0182U09B3", "DO0182U09C3")), index = structure(c(1484947800, 
1484948700, 1484949600, 1484950500, 1484951400, 1484952300), tzone = "", tclass = c("POSIXct", 
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", 
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "", tzone = "")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • @G, Grothendieck, thanks for your reply. I have since scaled my `wideRawXTS` dataframe using `scale()` and is now called `wideRawDFScaled`, When I run `cor()` on `wideRawDFScaled` which is an XTS object and has the dimensions [1:1343, 1:12] I see that DO0182U09B3.DO0182U09A3 has a correlation of 0.2759166. When I run your code above I get the same figure for the same two variables but it's repeated for all the time series. I have attached images to the OP showing what I am referring to. Also should the array return not be 12x12x1343? If I have 12 vars should I not have 12x12 rows and cols? – TheGoat Mar 09 '17 at 22:11
  • 1
    If you have n variables then there are choose(n, 2) = n * (n-1) / 2 values below the diagonal of the correlation matrix (or above). Please simplify your problem into a minimal one which exhibits the same issue and is fully reproducible. – G. Grothendieck Mar 09 '17 at 23:49
  • I tried your code again and I replaced `x` as per your example with my zoo object `wideRawDFscaled` and I believe it's working now. With my limited programming experience the definition of functions confuses me, I've seen so many people define `function(x)`, I'm guessing my function definition should be `Cor <- function(wideRawDFscaled) { corr <- cor(wideRawDFscaled)` – TheGoat Mar 10 '17 at 00:12
  • 1
    You can use any variable names you like in the arguments of your functions as long as you use the same name in the body of the function. They have no relation to variables of the same name outside the function. For example, changing `function(x) x` to `function(y) y` does nothing. They both produce the same output for the same input. – G. Grothendieck Mar 10 '17 at 00:18
  • 1
    I have changed the input name to `xx` to clearly distinguish it from the formal argument `x` in `Cor`. This does not change anything but is only for clarity in light of your comment. – G. Grothendieck Mar 10 '17 at 13:27