Suppose I have a table with the following columns and much more rows:
Id | n_positive_class1 | n_positive_class2 | n_positive_class3 |
---|---|---|---|
1 | 0 | 10 | 4000 |
2 | 122 | 0 | 0 |
3 | 4 | 5234 | 0 |
I'd like to select the maximum number of rows (by Id) so that the sum of the 3 columns for the chosen rows is as balanced as possible (perfect balance is impossible and the tolerance on the balance is probably a parameter).
Is there an already available function to do so? Otherwise could you help me in building such functions with the typical python libraries (pandas, numpy, scipy, etc.)?
My use case is to balance a dataset that don't fit in memory for training a machine learning model.