Scikit-learn principal component analysis (PCA) for dimension reduction

Question

I want to perform principal component analysis for dimension reduction and data integration.

I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.

I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].

Code

import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)

Output

[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]

Is taking 1st PCA after dimension reduction proper approach for data integration?

1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3]. So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?

In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?

score 6 · Accepted Answer · answered Oct 15 '17 at 13:50

6

Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.

In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.

answered Oct 15 '17 at 13:50

igrinis

12,398
20
45

Thank you for your answer. In fact those (same) features and (small) samples were just an example, but I understood that the result is correct. I would like to ask one more question if you don't mind. I added some additional question with some detail(1-2). Is it also proper way of 'data projection' as you mentioned? – z991 Oct 15 '17 at 17:44
In any case of strong correlation, no matter positive or negative, the features are essentially the same, because you covariance matrix will become degenerate. See this [short video](https://www.youtube.com/watch?v=kw9R0nD69OU), it will help you get a grip – igrinis Oct 17 '17 at 10:02
Thank you for your comment. The video you suggested was quite impressive. – z991 Oct 17 '17 at 16:31

score 2 · Answer 2 · answered Oct 18 '17 at 05:02

2

There is no need to use PCA for this small dataset. And for PCA you array should be scaled.

In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).

answered Oct 18 '17 at 05:02

avchauzov

1,007
1
8
13

Thank you very much for your comment. The data is the arrays of ranking, so I think scaling is ok. I think PCA is one of the most common way for finding a line(1st PC of 2D data) or plain(1st+2nd PC of 3D data) which express the data with lowest distance(error). Do you mean that other distance algorithms are more efficient? – z991 Oct 19 '17 at 04:40
1

Just one correction: you'll get for 2D data 2 lines (2 axis) and for 3D - 3 axis. And then you see, on which axis the variance is small and exclude them. I'm not sure about efficiency but in my opinion clusterization and getting centers of clusters may be a decent approach too. Can't tell about distance metric because it depends on your data. It' just my concern about PCA: when you have only 3 components reducing them means to loose a rather big amount of information. – avchauzov Oct 19 '17 at 05:47
Thank you very much for your answer. Sorry for my poor explanation, but line and plain I said were axes after dimension reduction of PC with least variance. I agree that reducing 1/3 or 1/2 features will cause big loss of information, but I am not sure whether there are better options of dimension reduction(or data integration) to minimize the loss of information. Do you have any better suggestion for this work? – z991 Oct 19 '17 at 08:01
Let me ask you: what for you need to perform PCA? What you mean under "want to use transformed data for further statistical analysis"? – avchauzov Oct 19 '17 at 09:50
I want to reduce feature space to 1-dimension for easy prediction, although it doesn't need to be PCA. In the 2-dimension example of my question 1-2, there are 2 features called 'power ranking' and 'speed ranking', but it is difficult to decide whether 'power 3 and speed 3' is better than 'power 4 and speed 2' or not directly. So I tried to transform 2-dimensional data into 1-dimension by PCA(taking 1st PC), and decide 'total rank' by this linear 1st PC. – z991 Oct 19 '17 at 13:30
Ah, sorry, I don't have any good ideas in this case right now. – avchauzov Oct 20 '17 at 13:20

Scikit-learn principal component analysis (PCA) for dimension reduction

2 Answers2