4

I'm trying to decompse my columns using PCA .

I'm finding some difficulties about how to choose my n_components of the function PCA using scikit learn in python. I did this

sc = StandardScaler()
Z = sc.fit_transform(X)
pca = PCA(n_components = 5')

Can you explain me please .

Mathilde
  • 39
  • 1
  • 7
  • 2
    Hi, Stack Overflow is not a general problem solving site. Please explain what you have already tried and provide the code of your attempt. We can then discuss any issues – Yakov Dan Dec 16 '18 at 12:28
  • Post edited .. thank you – Mathilde Dec 16 '18 at 12:41
  • Please note that your code is incorrect. `PCA(svd_solver='full, n_components = 5')` is a syntactic error – Yakov Dan Dec 16 '18 at 12:46
  • yes im sorry !! i just edited . My problem is how to know n_components = which number to decompose on ..Thank youu so much – Mathilde Dec 16 '18 at 12:51
  • This really depends on your application. Why do you run PCA? How much of the variance in your input do you want to retain? – Yakov Dan Dec 16 '18 at 12:58
  • i don't know how. Im just using PCA to reduce the number of columns, i have initially 70 variables .. and im asked to do PCA .. so i i centred an reduced my initial data and now im trying to fix n_components correctly.. Thank you for your response – Mathilde Dec 16 '18 at 13:03

1 Answers1

12

There is no answer that will tell you with probability 1 what is correct number of components. It is application specific.

However there is a following heuristic that you can use. You plot explained variance ratio and choose a number of components that "capture" at least 95% of the variance. In following example the number of components that capture around 95% of the variance is around 30.

pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

enter image description here

Farseer
  • 4,036
  • 3
  • 42
  • 61