0

I have a purely categorical data set, with a very imbalanced class weight (1:99).

I would like to train a model which will compute for each of the features and values of said feature, what importance it has on the prediction. So in essence to generate a dict like object:

vocabulary = {
'user=12345': 0,
'user=67890': 1,
'age=30': 2,
'age=40': 3,
'geo=UK': 4,
'geo=DE': 5,
'geo=US': 6,
'geo=BR': 7}

And to then attach to this a weight for importance:

weights = [.1, .2, .15, .25, .1, .1, .2, .2]

What python based machine learning library should I use, and what recommenadations for algorithms within the library which allow me to extract the above output.

I have tried; tensorflow linear regressor, scikit learn linear regressor & graphlab boosted trees. The boosted trees has seemed most promising but I would like to use an open source library if possible.

Thank you all very much in advance!

UPDATE:

GradientBoostingClassifier yields a 0.999137901985 score due to the imbalanced classes.

dendog
  • 2,976
  • 5
  • 26
  • 63

1 Answers1

2

Without knowing much about your underlying problem, sklearn.ensemble.RandomForestClassifier and sklearn.ensemble.GradientBoostingClassifier generate feature importances and should likely be easy enough to use for most purposes. Here's a simple example on the Iris sample data:

In [79]: from sklearn.datasets import load_iris

In [80]: from sklearn.ensemble import GradientBoostingClassifier

In [81]: gbm.fit(load_iris()["data"], load_iris()["target"])
Out[81]:
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [82]: zip(load_iris()["feature_names"], gbm.feature_importances_)
Out[82]:
[('sepal length (cm)', 0.072135639528234),
 ('sepal width (cm)', 0.10858443031280439),
 ('petal length (cm)', 0.31074531613629014),
 ('petal width (cm)', 0.43520128068933822)]
Randy
  • 14,349
  • 2
  • 36
  • 42
  • Thank you, as mentioned in the question @Randy C I have tried this method - my model is predicting 99.99999% due to the severe imbalance in class labels. Any suggestions for getting around this? Also I have extracted the feature importance but how do I map them to my feature names? – dendog Nov 03 '16 at 18:21
  • You mentioned graphlab's GBM and sklearn's linear model, but this is sklearn's GBM, which is open source as you requested in the question. – Randy Nov 03 '16 at 18:32
  • For the imbalance, you may just need to set a different score cutoff instead of the default 50%. Things like ROC AUC will give you an indication if your model is rank ordering well, and then you can just pick your scoring threshold based on your acceptance of false positives vs. false negatives. – Randy Nov 03 '16 at 18:33
  • The feature names part is shown in the last line: just `zip` the list of feature names in the order they're passed to the model with the model's feature importances. – Randy Nov 03 '16 at 18:34
  • Hey @Randy C thanks for the help! I have managed to zip the arrays together. However I am unsure on how to set a different score cut off? – dendog Nov 03 '16 at 18:35
  • Using `gbm.predict_proba()` instead of `gbm.predict()` gives you the model's raw predictions (as `[p(0), p(1)]` pairs for 2 class classification) rather than the predicted label. You can then pick any arbitrary score threshold. – Randy Nov 03 '16 at 18:43
  • Ok, but as mentioned I am not looking at the actual model for prediction, I need to understand the importance of each feature for the prediction of a positive class. Is there no way to set class weight similar to LinearRegressor? I am manipulating each param and not able to move the loss value either way. – dendog Nov 03 '16 at 18:47
  • If your model rank orders well (which you can assess with something like ROC AUC), then you should be able to just use the feature importances as is. The accuracy measure you'd get out of `gbm.score()` really doesn't matter in this case for anything. – Randy Nov 03 '16 at 18:58