How to find most relevant variables with PCA dimension reduction?

Question

I'm new to python coding, and working on a project but stuck at coding part. I have one target variable and 23 relevant variables. My dataset is 11(simples)*23(descriptors) and one target dataset 11(simples)*1(target variable). How do I find the most relevant variables from these 23 descriptors with PCA dimensional reduction?

    pca = PCA()
pca.fit(df2)
transformed = pca.transform(df2)

    from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
steps = [('pca', PCA()), ('m', LogisticRegression())]
model = Pipeline(steps=steps)`

    from sklearn.preprocessing import MinMaxScaler
steps = [('norm', MinMaxScaler()), ('pca', PCA()), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

    from sklearn.datasets import make_classification
X, y = make_classification(n_samples=11, n_features=23, n_informative=5, n_redundant=18, random_state=7)
print(X.shape, y.shape)

    from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=11, n_features=23, n_informative=5, n_redundant=18, random_state=7)
steps = [('pca', PCA(n_components=4)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

enter image description heredata

enter image description here I'm supposed to get this image but I don't know how to do.

Get the variance through singular value decomposition - the highest variance PCs will be those you want. — Fried Noodles, Oct 04 '20 at 20:35
@FriedNoodles just answered your question, and if you wanna reduce your dataset using PCA, use sklearn package. — Kevin Choi, Oct 04 '20 at 20:41
@EricT Please upload your dataset and code. (11,22) data frame is pretty small enough. — Kevin Choi, Oct 04 '20 at 20:46
@KevinChoi Added. df2 is my dataset which contains target variable and relevant descriptors. and when I set the component number as 5 the accuracy is about 85%, this accuracy is what I want given lack of data. But I don't know how to find which 5 descriptors are the most relevant. The last column is target variable. — EricT, Oct 04 '20 at 21:26
@FriedNoodles Should my dataset contain both target variable and relevant descriptors? — EricT, Oct 04 '20 at 22:04
I am not sure I understand your question, but what you are trying to is to select variables from your df2 without transforming df2, am i correct? — Kevin Choi, Oct 04 '20 at 22:08
@KevinChoi I'm not sure. I'm suck at coding. What I want to do is find most relevant descriptors with PCA by calculating variable importance parameter for each descriptors. I don't know how to transform the dataset. — EricT, Oct 04 '20 at 22:17
@KevinChoi I used the code from this site.https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/ — EricT, Oct 04 '20 at 22:19
@KevinChoi I want to find specific columns and descriptors and then use PLS regression to perform the target varialbe with these descriptors. The second image I uploaded is the variable importance parameter for each descriptors. About 26 descriptors and I'm supposed to select 4 or 5 most relevant descriptors to perform this target variable. — EricT, Oct 04 '20 at 22:23
@EricT There are a few things you should know. 1) PCA is a feature extraction method, not a feature selection method. You can reduce the dimensionality of df2 to 5, but you should know that would be projected values. 2) If you want to use PCA, your features should be linearly independent. Your df2 has more variables than observation, which means that it is not linearly independent. So you can't perform PCA on df2. Probably that's why you are having errors running your code — Kevin Choi, Oct 04 '20 at 22:34
@EricT If you want to select the most relevant columns from df2, I recommend you to read about the chi-square test for feature selection, which is one of the most basic features selection methods. — Kevin Choi, Oct 04 '20 at 22:52

How to find most relevant variables with PCA dimension reduction?

0 Answers0