Questions tagged [pca]

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Overview

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Mathematically, principal component analysis (PCA) amounts to an orthogonal transformation of possibly correlated variables (vectors) into uncorrelated variables called principal component vectors.

Tag usage

Questions on tag should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.

In scientific software for statistical computing and graphics, functions princomp and prcomp compute PCA.

2728 questions
14
votes
1 answer

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that? My code: import…
Tianyi Wang
  • 197
  • 1
  • 1
  • 6
14
votes
2 answers

Incremental PCA on big data

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5…
KrawallKurt
  • 449
  • 1
  • 5
  • 15
14
votes
2 answers

Scikit-Learn PCA

I am using input data from here (see Section 3.1). I am trying to reproduce their covariance matrix, eigenvalues, and eigenvectors using scikit-learn. However, I am unable to reproduce the results as presented in the data source. I've also seen this…
slaw
  • 6,591
  • 16
  • 56
  • 109
13
votes
5 answers

How to implement ZCA Whitening? Python

Im trying to implement ZCA whitening and found some articles to do it, but they are a bit confusing.. can someone shine a light for me? Any tip or help is appreciated! Here is the articles i read…
user2136049
12
votes
1 answer

scikit-learn TruncatedSVD's explained variance ratio not in descending order

The TruncatedSVD's explained variance ratio is not in descending order, unlike sklearn's PCA. I looked at the source code and it seems they use different way of calculating the explained variance ratio: TruncatedSVD: U, Sigma, VT = randomized_svd(X,…
Xiangyu
  • 824
  • 9
  • 34
12
votes
1 answer

What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

Suppose there is a matrix B, where its size is a 500*1000 double(Here, 500 represents the number of observations and 1000 represents the number of features). sigma is the covariance matrix of B, and D is a diagonal matrix whose diagonal elements are…
Shawn
  • 333
  • 1
  • 6
  • 15
12
votes
1 answer

Sklearn.KMeans() : Get class centroid labels and reference to a dataset

Sci-Kit learn Kmeans and PCA dimensionality reduction I have a dataset, 2M rows by 7 columns, with different measurements of home power consumption with a date for each…
flow
  • 571
  • 1
  • 4
  • 16
12
votes
2 answers

Hotelling's T^2 scores in python

I applied pca on a data set using matplotlib in python. However, matplotlib does not provide a t-squared scores like Matlab. Is there a way to compute Hotelling's T^2 score like Matlab? Thanks.
YC.Chui
  • 169
  • 1
  • 2
  • 7
12
votes
6 answers

Principal Component Analysis (PCA) on huge sparse dataset

I have about 1000 vectors x_i of dimension 50000, but they are very sparse; each has only about 50-100 nonzero elements. I want to do PCA on this dataset (in MATLAB) to reduce the unneeded extreme dimensionality of the data. Unfortunately, I don't…
Sean
  • 3,002
  • 1
  • 26
  • 32
12
votes
5 answers

PCA Implementation in Java

I need implementation of PCA in Java. I am interested in finding something that's well documented, practical and easy to use. Any recommendations?
Trup
  • 1,635
  • 13
  • 27
  • 40
11
votes
4 answers

How to whiten matrix in PCA

I'm working with Python and I've implemented the PCA using this tutorial. Everything works great, I got the Covariance I did a successful transform, brought it make to the original dimensions not problem. But how do I perform whitening? I tried…
mabounassif
  • 2,311
  • 6
  • 29
  • 46
11
votes
2 answers

pca.inverse_transform in sklearn

after fitting my data into X = my data pca = PCA(n_components=1) pca.fit(X) X_pca = pca.fit_transform(X) now X_pca has one dimension. When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D…
haneulkim
  • 4,406
  • 9
  • 38
  • 80
11
votes
1 answer

Principal component analysis (PCA) of time series data: spatial and temporal pattern

Suppose I have yearly precipitation data for 100 stations from 1951 to 1980. In some papers, I find people apply PCA to the time series and then plot the spatial loadings map (with values from -1 to 1), and also plot the time series of the PCs. For …
Yang Yang
  • 858
  • 3
  • 26
  • 49
11
votes
2 answers

Python PCA on Matrix too large to fit into memory

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what I'm doing: from sklearn.decomposition import PCA as RandomizedPCA import…
mt88
  • 2,855
  • 8
  • 24
  • 42
11
votes
1 answer

PCA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html. The examples seem to only contain Java and Scala. Does Spark MLlib support PCA analysis for Python? If so please point me to an example. If not, how to combine…
lapolonio
  • 1,107
  • 2
  • 14
  • 24