0
import pandas as pd
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import SelectPercentile

a = pd.read_csv('NCAA_2003-2016_with_diff.csv')

logreg = lm.LogisticRegression()

rfecv = RFECV(estimator=logreg, cv=10, scoring='?')

There are 914 rows * 191 columns, e.g:

x = df[['diff_dist','team1_log5','tpp','orp','tempo','efg','ftr','blk']]
y = df[['result']]

Which means there are other 'x' and I try to select most effective varaibles to predict result.

How to write a for loop to do this?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Hong
  • 1
  • 1
    To clarify, is 'x' your features and you want to know how to do feature selection? Where does 'df' come from? – Cecilia Mar 10 '17 at 22:12
  • You need to describe more about the data and what you want to do? – Vivek Kumar Mar 11 '17 at 02:35
  • x are features, y is the response variable, I wanna select several features among over 100 features in my data set based on the regression model, the measurement could be 'mean squared error' or 'f- score', do I clarify now? – Hong Mar 11 '17 at 22:47

0 Answers0