0

For my thesis I am seeing whether 5-fold cross-validation can be used to find the optimal number of principal component in PCR for time series data. I am using a 3 factor model.

However, when I try to run the PCR code I get an error as the data variable lengths differ, but I don't really understand why this is an issue and especially not how to solve it...

I get an error when I try to run my code. I am simulating data and then using 5-fold CV and PCR to find the optimal number of components (with smallest MSE), but when I try to run the PCR function in R it gives an error. Here is the most important parts of my code

# Set simulation parameters
n <- 100  # Number of observations
k <- 3    # Number of factors (also do for 8 factors later)
p <- 120 # Number of predictors
num_sim <- 5000  # Number of simulations
AR_coef <- 0.7 
error_var <- 1 
for (sim in 1:num_sim) { 
  # Simulate factors using AR(1) process
  arima_spec <- list(order = c(1, 0, 0), ar = AR_coef)
  
  # Simulate the factor model with AR(1) process using arima.sim
  factors <- matrix(arima.sim(model = arima_spec, n = n*k, innov = rnorm(n*k, sd = sqrt(error_var))), nrow = n, ncol = k)
  
  omega <- matrix(rnorm(n * p), nrow = n, ncol = p)
  
  # Generate factor loadings (beta) for predictors
  loadings <- matrix(rnorm(p * k), nrow = p, ncol = k)
  
  # Generate predictor variables
  X <- factors %*% t(loadings) + omega 
  
  # Standardise X
  X_mean <- colMeans(X)
  X_sd <- apply(X, 2, sd)
  X <- (X - X_mean)/X_sd  
  
  ## Generate epsilon following GARCH(1,1) process
  sim.spec    <- ugarchspec(variance.model     = list(model = "sGARCH", garchOrder = c(1,1)), 
                            mean.model         = list(armaOrder = c(0,0), include.mean = FALSE),
                            distribution.model = "norm", 
                            fixed.pars         = list(omega = 0.1, alpha1 = 0.2, beta1 = 0.7))
  path.sgarch <- ugarchpath(sim.spec, n.sim = n, n.start = 1)
  epsilon     <- as.vector(fitted(path.sgarch))
  
  # Generate response variable y
  theta <- matrix(rnorm(k), nrow = k, ncol = 1) 
  
  y <- factors %*% theta + epsilon
  
  data_set <- data.frame(y,X)

  for (ncomp in 1:max_components) {
    #Randomly shuffle data
    data <- data_set[sample(nrow(data_set)),]
    
    #Create 5 equally size fold
    folds <- cut(seq(1,nrow(data)),breaks = 5,labels = FALSE)
    
    #Perform 5 fold cross validation
    for(i in 1:5){
      #Segement your data by fold using the which() function 
      testIndexes <- which(folds==i, arr.ind=TRUE)
      testData <- data[testIndexes, ]
      trainData <- data[-testIndexes, ]
      
      ctrl <- trainControl(method = "cv", number = 5)
      
      #Generate Model
      pcr_fit <- pcr(unlist(X) ~ y, data = trainData, ncomp = ncomp)
    }
}

I get error:

Error in model.frame.default(formula = unlist(X) ~ y, data = trainData : variable lengths differ (found for 'y')

and I do not understand why. Could someone help?

Is there also an easy way to obtain the MSE from the PCR function or is it best to compute y_predict with the predict function in R and calculate it manually?

Mieska
  • 1
  • 1

0 Answers0