0

So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).

For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.

I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.

For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.

library(factoextra)

PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)

fviz_pca_biplot(res.pca)

Not centered PCA

However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:

res.pca <- prcomp(PCAf, scale = TRUE)

fviz_pca_biplot(res.pca)

centered PCA

This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.

res.var <- get_pca_var(res.pca)
res.var$coord

These are the values for the non centered PCA: non centered coords And for the centered PCA: enter image description here

Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?

My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.

This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.

Anyway, thank you all so much in advance.

PS: I uploaded a file containing a sample of my data, code and images on github

PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.

data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
  paste("wt", 1:5, sep=""),
  paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
  wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  
  data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)

res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)

fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)

Not centered - generic PCA: enter image description here

Centered - generic PCA: enter image description here

  • 1
    I'm not familiar with that package, but thinking out loud, is it possible that a big difference between the result of prcomp before and after scaling means that you would want to use princomp? prcomp involves svd on the covariance matrix and the second is an eigenvalue decomposition on the correlation matrix. The covariance is more general to the correlation, since the correlation is the covariance divided by the product of the standard deviations. With the covariance matrix, I think it needs to be standardized, but in the second case is maybe not so necessary to standardize first – hachiko Apr 27 '22 at 22:33
  • 1
    Your questions cannot be answered generally. It is necessary to know more about the variables and the goals of your analysis. If differences in the scale/size of the variables are important to your analysis, it may be useful to center the variables but not standardize them. In that case the variables with the largest magnitude will dominate the analysis. If the variables are measured on different scales (length, mass, area, etc) or with different units (cm, km, light years), you should probably center and standardize the variables first. That ensures that each variable is treated equally. – dcarlson Apr 27 '22 at 23:53
  • @hachiko thankyou for your tip! Being honest with you, I hadn't tried the princomp method before posting but did after I read your response. In the end, the same thing still happens: there are samples plotted on every quadrant, but all of the variables coordinates are positive for PC1 (using princomp, when I use prcomp they are all negative). So it still doesn't help me to understand what is going on here. I have updated the question with some extra images and I hope it clarifies what exactly it is that seem odd to me. Thank you in advance! – Igor Salerno Filgueiras Apr 28 '22 at 17:03
  • @dcarlson Thank you so much for your help. I updated the question with these some infos regarding the nature of my data as you mentioned and I hope it clarifies some doubts. If possible, could you take another look? Ty – Igor Salerno Filgueiras Apr 28 '22 at 17:05
  • I guess the princomp method is an SVD on the original data which is centered and maybe standardized and prcomp is an eigenvalue decomposition on the square matrix that is either a correlation matrix or a covariance matrix. I just learned that from another poster I’m learning this stuff also – hachiko Apr 28 '22 at 17:11

0 Answers0