I am working on a data whose shape is 2020, 1000 and I am trying to apply PCA to this dataset. When I look at the cumulative variance and n component number plotting, I observe a break (elbow) point at 0.2 cumulative variance level and it is corresponding around 11 as n component. However, I observe around 420 n components with 85% variance threshold. Moreover, n=11 with a cumulative variance value of 0.2 and n=420 with a value of 0.85 I got the same PCA1 - PCA2 plotting.
The point that doesn't make sense is,
In the sources I read, it is said that it is not good to choose low cumulative variance.
If I have to choose a high cumulative variance value, what exactly does that break (elbow) point mean?
Thanks in advance for the answers that will help me understand.
Here is the code and I tried both 11 and 420 n component values as you can see.
X = df.iloc[:, 1:]
y = df.iloc[:, 0]
#Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pca = PCA().fit(X_train)
plt.rcParams["figure.figsize"] = (12, 6)
fig, ax = plt.subplots()
xi = np.arange(1, len(pca.explained_variance_ratio_) + 1, step=1)
y = np.cumsum(pca.explained_variance_ratio_)
plt.ylim(0.0, 1.1)
plt.plot(xi, y, linestyle='-', color='b') #, marker='.')
plt.xlabel('Number of Components')
plt.xticks(np.arange(0, len(pca.explained_variance_ratio_) + 1, step=50))
plt.ylabel('Cumulative Variance (%)')
plt.title('The number of components needed to explain variance')
plt.axhline(y=0.85, color='r', linestyle='-')
plt.text(0.5, 0.85, '85% cut-off threshold', color='red', fontsize=16)
ax.grid(axis='x')
plt.show()
selected_num_components = 11 #420
X_train = pca.transform(X_train)[:, :selected_num_components]
X_test = pca.transform(X_test)[:, :selected_num_components]
import matplotlib.pyplot as plt
# Visualize the transformed data using the first two components
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Transformed Data')
plt.colorbar(label='Class')
plt.show()