2

Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.

In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions

First of all, I run the model and got the prediction on the whole dataset.

Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.

Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.

How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?

Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?

ken wang
  • 165
  • 1
  • 12

1 Answers1

0

The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.

One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.

MaximeKan
  • 4,011
  • 11
  • 26