Setting the optimal thresholds for a multiclass random forest and gbt model in pyspark (ml) on an imbalanced dataset

Question

Currently a Random Forest Classifier (from Spark ML) has been build on a imbalanced dataset which has the following distribution:

+-------+----------------+
| score | n_observations |
+-------+----------------+
|     0 |         256741 |
|     1 |          13913 |
|     2 |           7632 |
|     3 |          15877 |
|     4 |           3289 |
|     5 |          11515 |
|     6 |           8555 |
|     7 |           2087 |
|     8 |          14226 |
|     9 |           6379 |
+-------+----------------+

As an outcome from a multiclass problem I get the following probability matrix. The probability column gives the probability of an observation being classified as a class. (these are the normalized rawPredicitions from the Random Forest model).

+--------------------+-------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
| observation_number | label_score | prob_0  | prob_1  | prob_2  | prob_3  | prob_4  | prob_5  | prob_6  | prob_7  | prob_8  | prob_9  |
+--------------------+-------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|                  0 |           0 |  0,812  |  0,039  |  0,024  |  0,049  |  0,007  |  0,024  |  0,019  |  0,002  |  0,016  |  0,008  |
|                  1 |           7 |  0,419  |  0,050  |  0,032  |  0,092  |  0,018  |  0,083  |  0,080  |  0,013  |  0,132  |  0,082  |
|                  2 |           0 |  0,862  |  0,027  |  0,017  |  0,043  |  0,004  |  0,016  |  0,015  |  0,001  |  0,009  |  0,006  |
|                  3 |           0 |  0,845  |  0,028  |  0,018  |  0,049  |  0,004  |  0,018  |  0,018  |  0,001  |  0,011  |  0,008  |
|                  4 |           6 |  0,445  |  0,051  |  0,034  |  0,095  |  0,017  |  0,080  |  0,079  |  0,012  |  0,115  |  0,073  |
+--------------------+-------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+

The actual predictions classes are highly skewed towards the majority classes (0 & 8). This is due to the fact that the majority probability is used to assign a class to an observation as can be seen below. Applying an undersampling and an oversampling method does not increase the accuracy of the output from the confusion matrix.

Confusion matrix from the output, also put into table form

However, using these predictions the AUC curves seem quite promising. (average AUC of 0.74). See the plot of the one to many AUC curve per class below:

AUC curves for the different classes (one-to-many)

I am unsure how to retrieve the optimal class from these probabilities. Looking at the AUC there are multiple ways to get the optimal threshold. A solution would be to multiply with the inverse of the prior probability, whilst another would be to fit a linear regression. Another solution could be the pairwise analysis as explained in this paper, however I am not sure how to apply there algorithm precisely. Approximating the multiclass ROC by pairwise analysis

What would be the easiest Spark implementation to obtain some class weighing on the predicted probabilities for the Random Forest (or other tree classifiers)?

Setting the optimal thresholds for a multiclass random forest and gbt model in pyspark (ml) on an imbalanced dataset

0 Answers0