3

I have a dataset with ~13k features and I want to select the features that are contributing the most to the classification of a specific label.

I am using the sklearn.svm.LinearSVC class on single cell data.

The coef_ attribute should provide this information (as far as I understood) but when excluding the top 10-100 features from coef_, the accuracy / multi class f1-score is not decreasing.

Does somebody know how to extract this information based on a trained model?

I provided exemplary code down below that does the same but with an open source dataset!

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np


data = load_iris(return_X_y=True, as_frame=True)

print(data[1].unique()) #  [0 1 2] -> three classes

svc = LinearSVC()

svc.fit(data[0], data[1])

score = svc.score(data[0], data[1])

print(svc.coef_.shape)  # (3, 4)
fig, axs = plt.subplots(1, 3, figsize=(15, 7))
for label, ax in enumerate(axs.flatten()):
    args = np.argsort(-svc.coef_[label])
    vals = [svc.coef_[label][arg] for arg in args]
    ax.bar(args, vals)

    ax.title.set_text(f"{label}")

plt.tight_layout()



if __name__ == '__main__':
    plt.show()

Output plot of the code

Adrian
  • 363
  • 3
  • 12
  • If you remove informative features and don't see the accuracy score decreasing, then a) check for data leakage between your features and targets ( One thing to check is what is the minimum number of features needed to maintain that same score) and b) try a different measure (precision, recall, F1 score) – G. Anderson Jul 01 '21 at 16:51

0 Answers0