Questions tagged [pca]

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Overview

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Mathematically, principal component analysis (PCA) amounts to an orthogonal transformation of possibly correlated variables (vectors) into uncorrelated variables called principal component vectors.

Tag usage

Questions on tag should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.

In scientific software for statistical computing and graphics, functions princomp and prcomp compute PCA.

2728 questions
27
votes
3 answers

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from…
gc5
  • 9,468
  • 24
  • 90
  • 151
26
votes
2 answers

PCA on word2vec embeddings

I am trying to reproduce the results of this paper: https://arxiv.org/pdf/1607.06520.pdf Specifically this part: To identify the gender subspace, we took the ten gender pair difference vectors and computed its principal components (PCs). As Figure…
user2969402
  • 1,221
  • 3
  • 16
  • 26
26
votes
4 answers

Using Numpy (np.linalg.svd) for Singular Value Decomposition

Im reading Abdi & Williams (2010) "Principal Component Analysis", and I'm trying to redo the SVD to attain values for further PCA. The article states that following SVD: X = P D Q^t I load my data in a np.array X. X = np.array(data) P, D, Q =…
dms_quant
  • 85
  • 1
  • 4
  • 10
25
votes
3 answers

How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance

I have a data set of 9 samples (rows) with 51608 variables (columns) and I keep getting the error whenever I try to scale it: This works fine pca = prcomp(pca_data) However, pca = prcomp(pca_data, scale = T) gives > Error in…
Brian Jackson
  • 409
  • 1
  • 5
  • 16
25
votes
5 answers

Plot PCA loadings and loading in biplot in sklearn (like R's autoplot)

I saw this tutorial in R w/ autoplot. They plotted the loadings and loading labels: autoplot(prcomp(df), data = iris, colour = 'Species', loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size =…
O.rka
  • 29,847
  • 68
  • 194
  • 309
25
votes
4 answers

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows: pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) where data is a Spark DataFrame with one…
nanounanue
  • 7,942
  • 7
  • 41
  • 73
24
votes
2 answers

Performing PCA on large sparse matrix by using sklearn

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix However, I always get error. Can someone point out…
khassan
  • 349
  • 1
  • 2
  • 5
23
votes
1 answer

Finding the dimension with highest variance using scikit-learn PCA

I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest…
Alberto A
  • 1,160
  • 4
  • 17
  • 35
23
votes
3 answers

Adding ellipses to a principal component analysis (PCA) plot

I am having trouble adding grouping variable ellipses on top of an individual site PCA factor plot which also includes PCA variable factor arrows. My code: prin_comp<-rda(data[,2:9], scale=TRUE) pca_scores<-scores(prin_comp) #sites=individual site…
Lew
  • 350
  • 1
  • 4
  • 11
23
votes
5 answers

PCA first or normalization first?

When doing regression or classification, what is the correct (or better) way to preprocess the data? Normalize the data -> PCA -> training PCA -> normalize PCA output -> training Normalize the data -> PCA -> normalize PCA output -> training Which…
AlanS
  • 738
  • 1
  • 6
  • 13
22
votes
4 answers

Difference between PCA (Principal Component Analysis) and Feature Selection

What is the difference between Principal Component Analysis (PCA) and Feature Selection in Machine Learning? Is PCA a means of feature selection?
AbhinavChoudhury
  • 1,167
  • 1
  • 18
  • 38
21
votes
3 answers

How is the complexity of PCA O(min(p^3,n^3))?

I've been reading a paper on Sparse PCA, which is: http://stats.stanford.edu/~imj/WEBLIST/AsYetUnpub/sparse.pdf And it states that, if you have n data points, each represented with p features, then, the complexity of PCA is O(min(p^3,n^3)). Can…
GrowinMan
  • 4,891
  • 12
  • 41
  • 58
20
votes
4 answers

In sklearn.decomposition.PCA, why are components_ negative?

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd. When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact…
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
19
votes
2 answers

How to get "proportion of variance" vector from princomp in R

This should be very basic and I hope someone can help me. I ran a principal component analysis with the following call: pca <- princomp(....) summary(pca) Summary pca returns this description: PC1 PC2 PC3 Standard…
Neeraj Bhatnagar
  • 341
  • 1
  • 2
  • 6
19
votes
2 answers

Matlab - PCA analysis and reconstruction of multi dimensional data

I have a large dataset of multidimensional data(132 dimensions). I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web…
Simon
  • 4,999
  • 21
  • 69
  • 97