1

I have a binary classification task, where I fit the model using XGBClassifier classifier and try to predict ’1’ and ‘0’ using the test set. In this task I have a very unbalanced data majority ‘0‘ and minority ‘1’ at training data (of coarse the same in the test set). My data looks like this:

           F1         F2        F3   ….   Target
    S1     2          4         5    ….     0
    S2     2.3        4.3       6.4         1
    …       …          …         ….         ..
  S4000    3           6         7          0

I used the following code to train the model and calculate the roc value:

  my_cls=XGBClassifier()
  X=mydata_train.drop(['target'])
  y= mydata_train['target']
  x_tst=mydata_test.drop['target']
  y_tst= mydata_test['target']
  my_cls.fit(X, y)

  pred= my_cls.predict_proba(x_tst)[:,1]
  auc_score=roc_auc_score(y_tst,pred)

The above code gives me a value as auc_score, but it seems this value is for one class using this my_cls.predict_proba(x_tst)[:,1], If I change it to my_cls.predict_proba(x_tst)[:,0], it gives me another value as auc value. My first question is how can I directly get the weighted average for auc? My second question is how to select the right cut point to build the confusion matrix having the unbalanced data? This is because by default the classifier uses 50% as the threshold to build the matrix, but since my data is very unbalanced it seems we need to select a right threshold. I need to count TP and FP thats why I need to have this cut point.

If I use weight class to train the model, does it handle the problem (I mean can I use the 50% cut point by default)? For example some thing like this:

My_clss_weight=len(X) / (2 * np.bincount(y))

Then try to fit the model with this:

my_cls.fit(X, y, class_weight= My_clss_weight)

However the above code my_cls.fit(X, y, class_weight= My_clss_weight) does not work with XGBClassifier and gives me error. This works with LogessticRegression, but I want to apply with XGBClassifier! any idea to handle the issues?

Spedo
  • 355
  • 3
  • 13

1 Answers1

1

To answer your first question, you can simply use the parameter weighted of the roc_auc_score function.

For example -

roc_auc_score(y_test, pred, average = 'weighted')

To answer the second half of your question, can you please elaborate a bit. I can help you with that.

Vatsal Gupta
  • 471
  • 3
  • 8
  • I used the above code, but it gives the following error: 'bad input shape (3732, 2)'. I think this the because I used : pred=my_cls.predict_proba(X_tst) which gives an array with two elements for each sample (e.g, 0.34 0.64). – Spedo Mar 04 '20 at 14:29
  • In the case of binary models, you have to provide the probability of the positive class, i.e. pred[:, 1]. As I can see, you are providing the probability of both the classes i.e. positive and negative. – Vatsal Gupta Mar 04 '20 at 14:35
  • if I use **pred[:, 1]** it gives me exactly the same result that I get from this: **pred1=my_cls.predict_proba(x_tst)[:,1]** and then **roc_auc_score(y_tst,pred1)**. Do you think that is true? – Spedo Mar 04 '20 at 14:47
  • Yes, they are exactly the same things. :) So, you should be getting the same results. – Vatsal Gupta Mar 04 '20 at 14:51
  • Good, tnx. What about the threshold to build the confusion matrix?do you have any idea how I can obtain the best threshold to build the confusion matrix? – Spedo Mar 04 '20 at 15:02
  • Yes, you can. You can use the probability of the positive class i.e. pred[:, 1]. Now, you can use those probabilities with the threshold of your choice to make a confusion matrix. Let' say you want a threshold of 0.3, then you can label samples with pred[:, 1] > 0.3 as 1 and less than 0.3 as 0. Now, you can see the confusion matrix – Vatsal Gupta Mar 04 '20 at 15:13
  • thats the point, how can I select the best *threshold of my choice*. For example how I can know that 0.3 is the best cut-point? – Spedo Mar 04 '20 at 15:33
  • 1
    For understanding the best threshold you might have to look at the specificity-sensitivity curves for various thresholds. The roc_auc_curve function of sklearn gives out fpr, tpr and thresholds. You can calculate the sensitivity and specificity using the fpr and the tpr values and plot the specificity vs sensitivity graph. Now, you can see what values do you get from the graph. That way you can see which threshold suits you. – Vatsal Gupta Mar 04 '20 at 15:47