2

My goal is to rank features of a supervised machine learning dataset, by contributions to theris Principal component, thanks to this answer.

I set up an experiment in which I construct a dataset contains 3 infomative, 3 redundent and 3 noise features in order. Then find the index of the largest component on each principal axes.

However, I got a realy worse rank by using this method. Dont know what mistakes I have made. Many thanks for helping. Here is my code:

from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np

# Make a dataset which contains 3 Infomative, redundant, noise features respectively
X, _ = make_classification(n_samples=20, n_features=9, n_informative=3,
                           n_redundant=3, random_state=0, shuffle=False)

cols = ['I_'+str(i) for i in range(3)]
cols += ['R_'+str(i) for i in range(3)]
cols += ['N_'+str(i) for i in range(3)]
dfX = pd.DataFrame(X, columns=cols)


# Rank each feature by each priciple axis maximum component
model = PCA().fit(dfX)
_ = model.transform(dfX)

n_pcs= model.components_.shape[0]
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
most_important_names = [dfX.columns[most_important[i]] for i in range(n_pcs)]

rank = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

rank outputs:

{'PC0': 'R_1',
  'PC1': 'I_1',
  'PC2': 'N_1',
  'PC3': 'N_0',
  'PC4': 'N_2',
  'PC5': 'I_2',
  'PC6': 'R_1',
  'PC7': 'R_0',
  'PC8': 'R_2'}

I am expecting to see infomative features I_x to be ranked top3.

Xer
  • 493
  • 1
  • 6
  • 17
  • Could you possibly print out what do the column values contains? I think you are supposed to fit the PCA differently. Meaning to say you have to fit transform them individually instead of grouping them together. However im not 100% sure on this. – Axois Jul 25 '19 at 03:41
  • Thanks for your comment, the dataset is created by sklearn's `make_classification` method, https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html. Sorry, I couldnt properly paste the dataframe here. But you can reproduce it with `random_state` set to 0. – Xer Jul 25 '19 at 08:18

1 Answers1

2

PCA ranking criteria is the variance of each columns, if you would like to have a ranking, what you can do is to output the VarianceThreshold of each columns. You can do that by this

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold()
selector.fit_transform(dfX)
print(selector.variances_)

# outputs [1.57412087 1.08363799 1.11752334 0.58501874 2.2983772  0.2857617
# 1.09782539 0.98715471 0.93262548]

Which you can clearly see that the first 3 columns (I0, I1,I2) has the greatest variance, and thus makes the best candidate for using PCA with.

Axois
  • 1,961
  • 2
  • 11
  • 22
  • Thanks Axois, although the answer is not excatly what I want, you have offered an alternative way to approch the probelm. – Xer Jul 26 '19 at 08:28