Questions tagged [pca]

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Overview

Principal component analysis (PCA) is a statistical technique for dimension reduction often used in clustering or factor analysis. Given any number of explanatory or causal variables, PCA ranks the variables by their ability to explain greatest variation in the data. It is this property that allows PCA to be used for dimension reduction, i.e. to identify the most important variables from amongst a large set possible influences.

Mathematically, principal component analysis (PCA) amounts to an orthogonal transformation of possibly correlated variables (vectors) into uncorrelated variables called principal component vectors.

Tag usage

Questions on tag should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.

In scientific software for statistical computing and graphics, functions princomp and prcomp compute PCA.

2728 questions
19
votes
5 answers

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component…
Ensom Hodder
  • 1,522
  • 5
  • 18
  • 35
18
votes
8 answers

What is the fastest way to calculate first two principal components in R?

I am using princomp in R to perform PCA. My data matrix is huge (10K x 10K with each value up to 4 decimal points). It takes ~3.5 hours and ~6.5 GB of Physical memory on a Xeon 2.27 GHz processor. Since I only want the first two components, is…
384X21
  • 6,553
  • 3
  • 17
  • 17
18
votes
2 answers

Using mca package in Python

I am trying to use the mca package to do multiple correspondence analysis in Python. I am a bit confused as to how to use it. With PCA I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to…
Dan
  • 45,079
  • 17
  • 88
  • 157
17
votes
5 answers

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it? Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.
17
votes
3 answers

is it possible Apply PCA on any Text Classification?

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification). Now, I'm trying to apply PCA on this data, but python is…
zer03
  • 325
  • 1
  • 4
  • 15
17
votes
1 answer

Pass PCA preprocessing arguments to train()

I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows: preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8) Is it possible to pass the thresh argument directly to caret's…
Timm S.
  • 5,135
  • 6
  • 24
  • 38
16
votes
1 answer

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA. In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the…
16
votes
2 answers

Basic example for PCA with matplotlib

I trying to do a simple principal component analysis with matplotlib.mlab.PCA but with the attributes of the class I can't get a clean solution to my problem. Here's an example: Get some dummy data in 2D and start PCA: from matplotlib.mlab import…
Tyrax
  • 223
  • 2
  • 3
  • 7
16
votes
1 answer

R - how to make PCA biplot more readable

I have a set of observations with 23 variables. When I use prcomp and biplot to plot the results I run into several problems: the actual plot only occupies half of the frame (x < 0), but the plot is centered on 0, so half of space is wasted two…
Jakub Bochenski
  • 3,113
  • 4
  • 33
  • 61
15
votes
1 answer

PCA inverse transform manually

I am using scikit-learn. The nature of my application is such that I do the fitting offline, and then can only use the resulting coefficients online(on the fly), to manually calculate various objectives. The transform is simple, it is just data *…
Baron Yugovich
  • 3,843
  • 12
  • 48
  • 76
15
votes
2 answers

PCA with missing values in Python

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python? Thanks.
Emily
  • 825
  • 3
  • 10
  • 20
15
votes
4 answers

Test significance of clusters on a PCA plot

Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc.
mindlessgreen
  • 11,059
  • 16
  • 68
  • 113
15
votes
2 answers

PCA and KNN algorithm

I am using KNN to classify handwritten digits. I also now have implemented PCA to reduce the dimensionality. From 256 I went to 200. But I only notice like, ~0.10% loss of information. I deleted 56 dimension. Shouldn't the loss be bigger? Only when…
Test Test
  • 2,831
  • 8
  • 44
  • 64
14
votes
4 answers

PCA with several time series as features of one instance with sklearn

I want to apply PCA on a data set where I have 20 time series as features for one instance. I have some 1000 instances of this kind and I am looking for a way to reduce dimensionality. For every instance I have a pandas Data Frame, like: import…
Mina L.
  • 163
  • 1
  • 6
14
votes
3 answers

PCA memory error in Sklearn: Alternative Dim Reduction?

I am trying to reduce the dimensionality of a very large matrix using PCA in Sklearn, but it produces a memory error (RAM required exceeds 128GB). I have already set copy=False and I'm using the less computationally expensive randomised PCA. Is…
Chris Parry
  • 2,937
  • 7
  • 30
  • 71