-2

I am attempting to do a project with basketball. In this project, I have a ton of data regarding the performance of the players from the past. There are 54 features. I have just somewhat learned about PCA and z score(still fuzzy about it).

Could I use PCA to perform a feature selection on my features?

Thanks!

Song Mei
  • 69
  • 7

2 Answers2

1

Well, doing PCA and calculating Z-Scores may get you there, but there is a MUCH better way to approach this kind of problem. Please consider using Feature Engineering, to identify the features that are most highly related to a set of data (dependent variable) and removing the irrelevant or less important features with do not contribute much to our target variable (in order to achieve better overall accuracy for our model).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df = pd.read_csv("https://rodeo-tutorials.s3.amazonaws.com/data/credit-data-trainingset.csv")
df.head()

from sklearn.ensemble import RandomForestClassifier

features = np.array(['revolving_utilization_of_unsecured_lines',
                     'age', 'number_of_time30-59_days_past_due_not_worse',
                     'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans', 
                     'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
                     'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier()
clf.fit(df[features], df['serious_dlqin2yrs'])

# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

Just make whatever (pretty obvious) changes you need to make, to customize that code to your specific data set.

Here are a couple links that further explain how Feature Engineering works.

https://github.com/WillKoehrsen/feature-selector/blob/master/Feature%20Selector%20Usage.ipynb

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e

For your reference, here is a good link for understanding PCA better.

https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html

Also, here is a great link for understanding Z-Scores better.

Pandas - Compute z-score for all columns

ASH
  • 20,759
  • 19
  • 87
  • 200
0

Well, it depends on the feature importances and the scores that you get(such as accuracy, F1 score, ROC). If your model overfits, then you may remove less-important features.

https://en.wikipedia.org/wiki/Curse_of_dimensionality

It doesn't have to with PCA, in addition to ASH's reponse you can also use another tree models to find feature importances. Just do not forget to scale the features before modeling, if you don't scale then importance results may be corrupted.

boozy
  • 315
  • 1
  • 8