0

I am working on a data whose shape is 2020, 1000 and I am trying to apply PCA to this dataset. When I look at the cumulative variance and n component number plotting, I observe a break (elbow) point at 0.2 cumulative variance level and it is corresponding around 11 as n component. However, I observe around 420 n components with 85% variance threshold. Moreover, n=11 with a cumulative variance value of 0.2 and n=420 with a value of 0.85 I got the same PCA1 - PCA2 plotting.

The point that doesn't make sense is,

  1. In the sources I read, it is said that it is not good to choose low cumulative variance.

  2. If I have to choose a high cumulative variance value, what exactly does that break (elbow) point mean?

plotting

pca-1andpca-2

Thanks in advance for the answers that will help me understand.

Here is the code and I tried both 11 and 420 n component values as you can see.

X = df.iloc[:, 1:]
y = df.iloc[:, 0]

#Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

pca = PCA().fit(X_train)

plt.rcParams["figure.figsize"] = (12, 6)

fig, ax = plt.subplots()
xi = np.arange(1, len(pca.explained_variance_ratio_) + 1, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0, 1.1)
plt.plot(xi, y, linestyle='-', color='b') #, marker='.')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, len(pca.explained_variance_ratio_) + 1, step=50))
plt.ylabel('Cumulative Variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.85, color='r', linestyle='-')
plt.text(0.5, 0.85, '85% cut-off threshold', color='red', fontsize=16)

ax.grid(axis='x')
plt.show()

selected_num_components = 11 #420

X_train = pca.transform(X_train)[:, :selected_num_components]
X_test = pca.transform(X_test)[:, :selected_num_components]

import matplotlib.pyplot as plt

# Visualize the transformed data using the first two components
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Transformed Data')
plt.colorbar(label='Class')
plt.show()
ned
  • 61
  • 1
  • 5

0 Answers0