I have a Pandas data frame with the following columns : [ID | Group | Account]. The dataset contains about 5-6 million lines, and I'm trying to arrange that data to do some ML later.
Because the repartition of the data is exponential, meaning the number of accounts (about 500 classes) per group (about 80 classes) is linear when applying a logarithm to the result, I'd like to even out the repartition of the data.
How could I randomly select a number of accounts with a ceiling value, with the certitude that every group has been taken at least a few times for every account ?
I have tried various techniques, but I can't find one that's suitable enough to my problem.