0

My data is composed of 30 years of observable values for 17 samples. The test dataframe below minimizes the sample depth to three but retains the time series length. Currently, my code only returns the desired calculation for the first column when I anchor the year to 1991. Note: 1991 is the first possible calculation due to lags in the formula. I would like to be able to run the calculation iteratively for all years for each sample.

test <-tibble(Year=c(1988:2017),value=c(runif(30)),value.2=c(runif(30)),value.3=c(runif(30)))
index=1991
x <-test[test$Year ==index,"value"]
lag.3 <-test[test$Year ==index-3,"value"]
lag.2 <-test[test$Year ==index-2,"value"]
lag.1 <-test[test$Year ==index-1,"value"]
lead.3 <-test[test$Year ==index+3,"value"]
lead.2 <-test[test$Year ==index+2,"value"]
lead.1 <-test[test$Year ==index+1,"value"]
average_lag_3 =(lag.1 +lag.2 +lag.3)/3
average_lead_3 = (lead.1 +lead.2 +lead.3)/3
var.x <- ((((lag.3 +lag.2+lag.1)*(x-average_lag_3)^2)/2)+(((lead.3 +lead.2+lead.1)*(index-average_lead_3)^2)/2))/2
dscore <- average_lag_3 - average_lead_3/(sqrt(var.x)*sqrt(2/3))
  • What's the desired output? Do you only care about `dscore`, or do you want `var.x` too? Anything else? – Gregor Thomas Nov 12 '20 at 05:32
  • in the `var.x` computation, do you need to use `index - average_lag_3` or did you mean to say `x-average_lag_3`?? – Onyambu Nov 12 '20 at 08:04

2 Answers2

0

Assuming you only want the dscore back, I put your code in a function. The only changes I made where replacing "value" with column and test with data, to match the argument names I chose:

## you should choose a better name than `foo`...
foo <- function(column, data, index = 1991) {
  x <- data[data$Year == index, column]
  lag.3 <- data[data$Year == index - 3, column]
  lag.2 <- data[data$Year == index - 2, column]
  lag.1 <- data[data$Year == index - 1, column]
  lead.3 <- data[data$Year == index + 3, column]
  lead.2 <- data[data$Year == index + 2, column]
  lead.1 <- data[data$Year == index + 1, column]
  average_lag_3 = (lag.1 + lag.2 + lag.3) / 3
  average_lead_3 = (lead.1 + lead.2 + lead.3) / 3
  var.x <-
    ((((lag.3 + lag.2 + lag.1) * (x - average_lag_3) ^ 2) / 2) + 
       (((lead.3 + lead.2 + lead.1) * (index - average_lead_3) ^ 2) / 2)) / 2
  dscore <- average_lag_3 - average_lead_3 / (sqrt(var.x) * sqrt(2 / 3))
}

## We can then apply this function to all the column names except the first one:
sapply(names(test[-1]), foo, data = test)
# $value.value
# [1] 0.4992873
# 
# $value.2.value.2
# [1] 0.1238061
# 
# $value.3.value.3
# [1] 0.8298876

Similarly, we can loop over years and columns:

## Generate combinations
my_years = 1991:1994
my_cols = names(test[-1])
year_col = expand.grid(year = my_years, column = my_cols)
## Make a place for the results
year_col$dscore = NA

## Do the calculations
for(i in 1:nrow(year_col)) {
  year_col$dscore[i] = foo(year_col$column[i], index = year_col$year[i], data = test)  
}
year_col
#    year  column     dscore
# 1  1991   value  0.4992873
# 2  1992   value   0.389055
# 3  1993   value  0.4510184
# 4  1994   value  0.5745417
# 5  1991 value.2  0.1238061
# 6  1992 value.2 0.08960943
# 7  1993 value.2  0.2999356
# 8  1994 value.2  0.4479232
# 9  1991 value.3  0.8298876
# 10 1992 value.3  0.7607005
# 11 1993 value.3  0.6520277
# 12 1994 value.3  0.5498328
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks, I will give this a try. Much appreciated! – GreenStudent Nov 12 '20 at 17:26
  • I am getting Error: Can't subset with `[` using an object of class factor, when running the last loop. – GreenStudent Nov 12 '20 at 18:00
  • Can you confirm that it runs on the tibble you provided in the question? Those are the results that I show. If that's the case, then my question is to you what is the difference between the (working) sample data and the (not working) real data? Look especially at the column classes - do you have a `factor` column that should be numeric? Or do you need to exclude more than just the first column from the loop? – Gregor Thomas Nov 12 '20 at 18:04
  • I use `names(test[-1])` to use all the column names except the first one. You might need to adjust that to run on only the appropriate columns. – Gregor Thomas Nov 12 '20 at 18:05
0

Using Base R, you could write a function that returns all the values:

lead_lag_mean <- function(x, n){
  
  lag <- base::prop.table(c(y <- numeric(n),0, y + 1) )
  lead <- base::rev(lag)
  
  lead_mean <- stats::filter(x, lead)
  lag_mean <- stats::filter(x, lag)
  
  var_x <- (lag_mean * (x - lag_mean)^2 + lead_mean * (x - lead_mean)^2)*n/(n-1)/2
  dscore <- (lag_mean - lead_mean) /sqrt(var_x * 2/n)
  data.frame(lead_mean = lead_mean, lag_mean = lag_mean, var_x = var_x, dscore = dscore)
}

lapply(test[-1], lead_lag_mean, n=3)

try this on your test and check whether the first row matches your results

Onyambu
  • 67,392
  • 3
  • 24
  • 53