Project training data onto PCA R

Question

I'm a total beginner in ML, R, you name it and I'm using FactoMineR's PCA function on my training set to find the principal components of my data.

res_pca <- PCA(training, scale.unit=TRUE, graph=FALSE)

Now I have to project my training data onto the space spanned by the vectors found by PCA. How do I do that?

This is what I've tried:

# get_pca_var is from the factoextra package
var <- get_pca_var(res_pca)
# training_no_target is just the training dataset without the target variable, which is a factor 
train_pca <- as.matrix(training_no_target) %*% var$coord
train_pca <- data.frame(train_pca)

I think train_pca should now contain the final dataset onto which training my models... is this right?

Dan Adams · Answer 1 · 2022-02-12T14:38:33.667

PCA is an unsupervised method that is used for descriptive modeling rather than predictive modeling. Therefore we don't usually think of projecting data with PCA as training per se. However it is possible to define a PCA space with one dataset and ask where new data falls into that same space. You were on the right track using the rotations (p$var$coord) to rotate the new data matrix with %*%.

Note you have to be careful to apply the same scaling and centering on your new data. This is also discussed here.

Here's an example with the iris dataset where we define a PCA projection with half the data and then project the other half into that PCA space by %*%ing by the rotation.

library(tidyverse)
library(FactoMineR)

# split data
set.seed(1)
splits <- sample(nrow(iris), nrow(iris)/2)
train <- iris[splits,]
test <- iris[-splits,]

# build PCA rotation based on 'training' data
p <- train[,-5] %>% PCA(graph = F)

# projection of training data
p$ind$coord %>%
  as.data.frame() %>% 
  bind_cols(., Species = train[,5]) %>% 
  ggplot(aes(Dim.1, Dim.2)) +
  ggtitle("original points") +
  geom_point(aes(color = Species))

# scale and project 'test' data into original PCA space
test[,-5] %>% 
  scale(center = p$call$centre, scale = p$call$scale.unit) %*% p$var$coord %>% 
  as.data.frame() %>% 
  bind_cols(., Species = test[,5]) %>% 
  ggplot(aes(Dim.1, Dim.2)) +
  ggtitle("projection of new points into original space") +
  geom_point(aes(color = Species))

^{Created on 2022-02-12 by the reprex package (v2.0.1)}

So is ```p$ind$coord```, in your example, the dataset I should use to train my model? What's the difference between ```p$ind$coord``` and ```p$var$coord``` ? — IDK, Feb 12 '22 at 07:09

Project training data onto PCA R

1 Answers1