I have a dataset with ~13k features and I want to select the features that are contributing the most to the classification of a specific label.
I am using the sklearn.svm.LinearSVC class on single cell data.
The coef_ attribute should provide this information (as far as I understood) but when excluding the top 10-100 features from coef_, the accuracy / multi class f1-score is not decreasing.
Does somebody know how to extract this information based on a trained model?
I provided exemplary code down below that does the same but with an open source dataset!
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
data = load_iris(return_X_y=True, as_frame=True)
print(data[1].unique()) # [0 1 2] -> three classes
svc = LinearSVC()
svc.fit(data[0], data[1])
score = svc.score(data[0], data[1])
print(svc.coef_.shape) # (3, 4)
fig, axs = plt.subplots(1, 3, figsize=(15, 7))
for label, ax in enumerate(axs.flatten()):
args = np.argsort(-svc.coef_[label])
vals = [svc.coef_[label][arg] for arg in args]
ax.bar(args, vals)
ax.title.set_text(f"{label}")
plt.tight_layout()
if __name__ == '__main__':
plt.show()