I have to perform a PCA on a high-dimensional dataset with the infrared spectra of different wines and then plot it in 2D. I have to color the red wines in red and the white wines in turquoise on the plot.
This is the code I came up with:
wine_pca <- prcomp(data[,-c(1:9)]) #eliminate columns 1-9 which contain other non-numeric information
pc <- predict(wine_pca)
pc1 <- predict(wine_pca)[,1]
pc2 <- predict(wine_pca)[,2]
#plot principal components pc1 & pc2
ggplot(pc, aes(PC1, PC2)) + theme_bw() +
geom_point(aes(shape = data$name, color = data$color), show.legend = TRUE, size = 3) +
scale_shape_manual(values = c(3, 4, 8, 21, 22, 23, 24, 25)) +
scale_color_manual(guide=FALSE, values=c("red", "turquoise")) +
theme(legend.position = 'right', legend.title = element_blank()) +
xlab("First Principal Component") +
ylab("Second Principal Component") +
ggtitle("First Two Principal Components of a Selection of Wines")
I thought it was looking and running pretty good, but the feedback I got from my professor was:
"Why did you rescale the data for pca? This does not make sense in this case (otherwise please explain) and leads to different results"
As I am a doofus, I don't really understand the feedback - where did I scale the data? Is my approach fundamentally wrong? I would be mighty grateful if one of you whiz kids could help a pretty hopeless girl out. Thanks!