0

Suppose I have a table with the following columns and much more rows:

Id n_positive_class1 n_positive_class2 n_positive_class3
1 0 10 4000
2 122 0 0
3 4 5234 0

I'd like to select the maximum number of rows (by Id) so that the sum of the 3 columns for the chosen rows is as balanced as possible (perfect balance is impossible and the tolerance on the balance is probably a parameter).

Is there an already available function to do so? Otherwise could you help me in building such functions with the typical python libraries (pandas, numpy, scipy, etc.)?

My use case is to balance a dataset that don't fit in memory for training a machine learning model.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user11696358
  • 356
  • 1
  • 15

0 Answers0