0

here i have tried to perform pca on my dataset but i dont have any idea how to get the important features and eleminate the feature which is not selected. here i have given a condition that if data contains more than 10 features then perform PCA else dont perform PCA.

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns)
    percent = 90
    my_num = int((percent/100)*no_of_col)
    if no_of_col >= 10:
        pca = PCA(n_components = my_num)
        x_new = pca.fit_transform(x)
        print("More than 10 columns found Performing PCA")
        return selected_var
    else:
        print("Less than 10 columns found no PCA performed")
        return x
        
        
x = Perform_PCA(x)
x
  • You should check this question: https://stackoverflow.com/questions/23294616/how-to-use-scikit-learn-pca-for-features-reduction-and-know-which-features-are-d?rq=1 – Robin Thibaut May 31 '22 at 08:26

2 Answers2

0

In your current code you create my_num components, but only if you have more then 10 columns.

If you want to have a look and select the features yourself you could modify your code:

 pca = PCA()
 x_new = pca.fit_transform(x)
 explained_variance = pca.explained_variance_ratio_
 print(explained_variance)
 print(pd.DataFrame(pca.components_,columns=x.columns))

This will give you the explained variance for every feature in your dataset. From here you can set the bar how many features should be selected.

Fabian
  • 756
  • 5
  • 12
  • so like how will i get the column names which are selected after the pca this is what i wanted to know. – Dharambir Maht0 May 31 '22 at 08:35
  • Generally every feature is "selected" but the weight they influence the PCA Components is different. I adapted the code so you can see which of your columns contribute the most to each Componont of your PCA. Also take a look at @RobinThibaut's answer because you might have a var-naming error ;) – Fabian May 31 '22 at 08:41
0

I will first review your function:

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns) 
    percent = 90 
    my_num = int((percent/100)*no_of_col)
    if no_of_col >= 10:
          pca = PCA(n_components = my_num)
          x_new = pca.fit_transform(x)
          print("More than 10 columns found Performing PCA")
          return selected_var
    else:
          print("Less than 10 columns found no PCA performed")
          return x

You are performing PCA only if there are more than ten columns, but your function returns selected_var, which does not exist.

Also, PCA does not "select features", it transforms the input data by computing a lower-dimensional representation. If you want to remove columns, use the pca.transform(x) function.

Here is your code modified (it would be possible to optimise it further, but I tried to change it as little as possible):

from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x): 
    no_of_col = len(x.columns) 
    percent = 90 
    my_num = int((percent/100)*no_of_col)

    if no_of_col >= 10:
        pca = PCA(n_components = my_num)
        x_new = pca.fit_transform(x)
        print("More than 10 columns found Performing PCA")
        return x_new
    else:
         print("Less than 10 columns found no PCA performed")
         return x

Hope this will help you.

Robin Thibaut
  • 610
  • 4
  • 11