0

I need to understand what the scatterplot created by 2 principal components convey.

I was working on the 'boston housing' dataset from the 'sklearn.datasets' library. I standardized the predictors and the used 'PCA' from 'sklearn.decomposition' library to get 2 principal components and plotted them on the graph.

Now all I want is help in interpreting what the plot says in simple language.enter image description here

2 Answers2

0

Each principal component can be understood as linear combinations of all the features in your dataset. For example if you have three variables A, B and C, then one possibility for a principal component could be calculate it by 0.5A + 0.25B + 0.25C. And a datapoint with values [1, 2, 4] would end up with 0.5*1 + 0.25*2 + 0.25*4 = 2 on the principal component.

The first principal component is extracted by determining the combination of features that yields the highest variance in the data. This roughly means that we tweak the multipliers (0.5, 0.25, 0.25) for each variable such that the variance between all observations is maximized.

The first principal component (green) and second (pink) of 2d data is visualised by the lines through the data in this plot

0

The PCs are a linear combination of the features. Basically, you can order the PCs on captured variance in the data and label from highest to lowest. PC1 would contain most of the variance, then PC2 etc. Thus for each PC it is known how much variance it exactly explained. However, when you scatterplot the data in 2D, as you did in the boston housing dataset, the it is hard to say “how much” and “which” features were contributing in the PCs. Here is were the “biplot” comes into play. The biplot can plot for each feature its contribution by its angle and length of the vector. When you do this, you will not only know how much variance was explained by the top PCs, but also which features were most important.

Try the ‘pca’ library. This will plot the explained variance, and create a biplot.

pip install pca

from pca import pca

# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)

# Or reduce the data towards 2 PCs
model = pca(n_components=2)

# Fit transform
results = model.fit_transform(X)

# Plot explained variance
fig, ax = model.plot()

# Scatter first 2 PCs
fig, ax = model.scatter()

# Make biplot
fig, ax = model.biplot(n_feat=4)
erdogant
  • 1,544
  • 14
  • 23