For my thesis I am seeing whether 5-fold cross-validation can be used to find the optimal number of principal component in PCR for time series data. I am using a 3 factor model.
However, when I try to run the PCR code I get an error as the data variable lengths differ, but I don't really understand why this is an issue and especially not how to solve it...
I get an error when I try to run my code. I am simulating data and then using 5-fold CV and PCR to find the optimal number of components (with smallest MSE), but when I try to run the PCR function in R it gives an error. Here is the most important parts of my code
# Set simulation parameters
n <- 100 # Number of observations
k <- 3 # Number of factors (also do for 8 factors later)
p <- 120 # Number of predictors
num_sim <- 5000 # Number of simulations
AR_coef <- 0.7
error_var <- 1
for (sim in 1:num_sim) {
# Simulate factors using AR(1) process
arima_spec <- list(order = c(1, 0, 0), ar = AR_coef)
# Simulate the factor model with AR(1) process using arima.sim
factors <- matrix(arima.sim(model = arima_spec, n = n*k, innov = rnorm(n*k, sd = sqrt(error_var))), nrow = n, ncol = k)
omega <- matrix(rnorm(n * p), nrow = n, ncol = p)
# Generate factor loadings (beta) for predictors
loadings <- matrix(rnorm(p * k), nrow = p, ncol = k)
# Generate predictor variables
X <- factors %*% t(loadings) + omega
# Standardise X
X_mean <- colMeans(X)
X_sd <- apply(X, 2, sd)
X <- (X - X_mean)/X_sd
## Generate epsilon following GARCH(1,1) process
sim.spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1,1)),
mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
distribution.model = "norm",
fixed.pars = list(omega = 0.1, alpha1 = 0.2, beta1 = 0.7))
path.sgarch <- ugarchpath(sim.spec, n.sim = n, n.start = 1)
epsilon <- as.vector(fitted(path.sgarch))
# Generate response variable y
theta <- matrix(rnorm(k), nrow = k, ncol = 1)
y <- factors %*% theta + epsilon
data_set <- data.frame(y,X)
for (ncomp in 1:max_components) {
#Randomly shuffle data
data <- data_set[sample(nrow(data_set)),]
#Create 5 equally size fold
folds <- cut(seq(1,nrow(data)),breaks = 5,labels = FALSE)
#Perform 5 fold cross validation
for(i in 1:5){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i, arr.ind=TRUE)
testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]
ctrl <- trainControl(method = "cv", number = 5)
#Generate Model
pcr_fit <- pcr(unlist(X) ~ y, data = trainData, ncomp = ncomp)
}
}
I get error:
Error in model.frame.default(formula = unlist(X) ~ y, data = trainData : variable lengths differ (found for 'y')
and I do not understand why. Could someone help?
Is there also an easy way to obtain the MSE from the PCR function or is it best to compute y_predict with the predict function in R and calculate it manually?