-1

I am working with Medical Images (DICOM Images) to classify them into three different class diseases, but I don't have equal distribution of training images for each class. Is it a valid approach to just copy and paste the unequal ones until they all are equal in number? if not what should be a better way ?

Mike
  • 21
  • 1
  • 4

1 Answers1

3

You have imbalance in the data and its common. Your solution is essentially oversampling and is a known strategy. I would use a formal solution such as np.random.choice, or np.random.rand and implement a bootstrap. Alternatively, itertools.combinations is another approach

Background There are 3 ways to solve it, one being undersampling, oversampling and the third is changing the performance metric.

If you have a say 30:30:40 imbalance for disease X,Y, and Z. Undersampling is to delete the excess by resample deleting Z to achieve balance.

If you have 15:15:70 for X,Y,Z you might consider oversampling by resampling X, and Y to achieve balance. Personally, I'm not a fan, but just my opinion.

Alternatively you could simply use use precision and recall as performance metrics, rather than accuracy. Thus use precision-recall curves much like ROC.

The best solution of all is simply to collect more data, but this is usually not practical.


In my opinion undersampling is a very good solution but creates problems when you end up deleting very large amounts of data. However, you could of course solve this problem via replicates, or more specifically large numbers of replicates and use your given metric until you are satisfied you've achieved stability.

M__
  • 614
  • 2
  • 10
  • 25