2

I am trying to perform a PCA analysis using the psych package in R.

I got two variables that I want to combine into one component displaying standard of living:

  • slvpen: Standard of living of pensioners: 0 = Extremely bad, 10 = Extremely good.
  • slvuemp: Standard of living of unemployed: 0 = Extremely bad, 10 = Extremely good.

slvpens:

Min. 1st Qu. Median Mean 3rd Qu. Max. Standard Deviation 0.000 3.000 5.000 4.587 6.000 10.000 2.28857

slvuemp:

Min. 1st Qu. Median Mean 3rd Qu. Max. Standard Deviation 0.000 3.000 4.000 4.095 5.000 10.000 2.099822

Using the phych-package, I perfom the analysis:

(slv_pca <- ESS %>% prcomp(
  formula = ~ slvpens + slvuemp, # Selecting variables
  data = ., na.action = na.exclude)) # Exclude NAs

With the following results:

Standard deviations (1, .., p=2):
[1] 2.651352 1.611470

Rotation (n x k) = (2 x 2):
               PC1        PC2
slvpens -0.7699869  0.6380597
slvuemp -0.6380597 -0.7699869

Everything is good. However, if I z-standardize the variables:

(slv_pca <- ESS %>% prcomp(
  formula = ~ slvpens + slvuemp, # Selecting variables
  data = ., na.action = na.exclude, # Exclude NAs
  center = TRUE, scale = TRUE)) # Z-standardize

The picture changes and both PC1 and PC2 is equal. Also, my two components contribute exactly the same?

Standard deviations (1, .., p=2):
[1] 1.2058739 0.7388289

Rotation (n x k) = (2 x 2):
               PC1        PC2
slvpens -0.7071068  0.7071068
slvuemp -0.7071068 -0.7071068

What is going on here?

SnupSnurre
  • 363
  • 2
  • 12
  • You should try to do just a `scale` and not use the `center` option. (Is just a trial!! :) ) In this way you will scale just the data. In alternative you will scale the data out of the function. – Earl Mascetti May 03 '20 at 10:33
  • @SlowLearning, Thanks for the suggestion! I tried removing the `center` option but no difference at all! – SnupSnurre May 03 '20 at 11:10
  • The problem should be the correlation between the two series. I think that the correlation will be almost 1. At this stage, I suggest you to migrate the question to Cross Validated. – Earl Mascetti May 03 '20 at 11:21
  • Your data is already scaled, since it is ordinal. Remember the purpose of scale is to ensure they have the same range and it's already the case for your data, so why scale? – StupidWolf May 03 '20 at 11:26
  • Also for ordinal data, you can also consider other methods of analysis https://stats.stackexchange.com/questions/215404/is-there-factor-analysis-or-pca-for-ordinal-or-binary-data – StupidWolf May 03 '20 at 11:27
  • @StupidWolf, why do you characterize it as ordinal and not interval? – SnupSnurre May 03 '20 at 12:13
  • sounds like ordinal to me? https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/ .. scale 1 to 10 ? good to bad? – StupidWolf May 03 '20 at 12:15

1 Answers1

2

The purpose of scaling / centering before PCA is to ensure you give your variables equal weight, and center your PC scores, see more here. Right now you have two variables that are already on the same scale.

You don't need to scale, see my example below:

# here i convert the iris columns into 1:10 ranks
scale_iris  =apply(iris[,1:4],2,function(i)as.numeric(cut(i,10,labels=1:10)))

par(mfrow=c(1,2))
plot(prcomp(iris[,1:4],scale=TRUE,center=TRUE)$x[,1:2],
col=factor(iris$Species),main="Actual iris PCA")
plot(prcomp(scale_iris,center=TRUE)$x[,1:2],
col=factor(iris$Species),main="Scale iris PCA")

enter image description here

If there is information in the ordinal variables, and they are on the same scale, it will be captured by the PCA.

And also of note, by default prcomp() centers the data (as it should) and does not scale unless specified.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks. It makes perfectly sense - and works perfectly. However, centering (true/false) still makes a huge difference? According to your link, it is recommended to center the data. But why? – SnupSnurre May 03 '20 at 12:08
  • Ok sorry I checked the code for prcomp again, it by default centers the data. So yes, you are right, you should center the data. This is so that you have a PC score that passes through the origin.. Otherwise the scores will be weird, and you have trouble using them – StupidWolf May 03 '20 at 12:19
  • you can check this post, the plot https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca, in general i think of it as like regression. if you don't center, you might have some issues with using the scores – StupidWolf May 03 '20 at 12:23