3

I can apply PCA on the classic Iris dataset to obtain the cumulative proportion per dimension:

library(tidyverse)
x <- iris[,1:4] %>% as.matrix()
pca <- prcomp(x)
summary(pca)

But I don't know how can I do that with tidymodels. My code so far is:

library(tidymodels)
iris_vars <- iris %>% select(-Species)
iris_rec <- recipe(~., iris_vars) %>%
  step_pca(all_predictors())
iris_prep <- prep(iris_rec)
iris_tidy <- tidy(iris_prep,1)
iris_tidy
summary(iris_tidy)

I would like to obtain this with tidymodels:

Importance of components:
                          PC1     PC2    PC3     PC4
Standard deviation     2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

Any help will be greatly appreciated.

Manu
  • 1,070
  • 10
  • 27

1 Answers1

4

You can get the same results, if you use the same model. prcomp() defaults to center = TRUE, whereas step_pca() defaults to center = FALSE. In the following, I use centering and scaling for both (since this is often recommended).

library("tidymodels")

x <- iris[,1:4] %>% as.matrix()
pca <- prcomp(x, scale. = TRUE)
summary(pca)
#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     1.7084 0.9560 0.38309 0.14393
#> Proportion of Variance 0.7296 0.2285 0.03669 0.00518
#> Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

iris_rec <- recipe(Species ~ ., iris) %>%
    step_normalize(all_predictors()) %>% 
    step_pca(all_predictors())
iris_prep <- prep(iris_rec)

summary(iris_prep$steps[[2]]$res)
#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     1.7084 0.9560 0.38309 0.14393
#> Proportion of Variance 0.7296 0.2285 0.03669 0.00518
#> Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Created on 2020-05-29 by the reprex package (v0.3.0)

hplieninger
  • 3,214
  • 27
  • 32
  • Hello @hplieninger, thank you for the answer given!. You know, I was struggling because I tried to use your solution but it gave me this error `subscript out of bounds`. Then I deducted that since my code didn't have the normalize step, then I had to use `iris_prep$steps[[1]]$res` which is the one and only step applied that gives the info. Just a question please, that information about summaries, where did you find it? Thank you! – Manu May 29 '20 at 14:21
  • 1
    I knew that the recipes packages stores all information about a specific step in the respective `$steps` element. I `print`ed it and immediately saw the standard deviations, and then I was just lucky that I also tried out `summary()` on it. – hplieninger May 29 '20 at 14:53
  • 1
    Note that, in tidymodels, most models are fit using the parsnip package. The case of the PCA is a little bit special, because it is often used in recipes as a data pre-processing step. – hplieninger May 29 '20 at 14:53
  • 1
    You might also like my intro to tidymodels: https://hansjoerg.me/2020/02/09/tidymodels-for-machine-learning/ – hplieninger May 29 '20 at 14:54
  • Thank you very much @hplieninger, great article and superb explanation! I was following the Tidy Tuesday videos from Julia Silge and some concepts I couldn't grasp I see them in your blog. Have a great day! – Manu May 29 '20 at 16:13