2

I have a dataframe with data that I feed to a ML library in python. The data I have is categorized into 5 different tasks, t1,t2,t3,t4,t5. The data I have right now for every task is uneven, to simplify things here is an example.

task, someValue
t1,   XXX
t1,   XXX
t1,   XXX
t1,   XXX
t2,   XXX
t2,   XXX

In the case above, I want to remove random rows with the task label of "t1" until there is an equal amount of "t1" as there is "t2" So after the code is run, it should look like this:

task, someValue
t1,   XXX
t1,   XXX
t2,   XXX
t2,   XXX

What is the most clean way to do this? I could of course just do for loops and if conditions and use random numbers and count the occurances for each iteration, but that solution would not be very elegant. Surely there must be a way using functions of dataframe? So far, this is what I got:

def equalize_rows(df):
    t = df['task'].value_counts()
    mininmum_occurance = min(t)
Fupp2
  • 23
  • 2

1 Answers1

2

You can calculate the smallest number of tasks in your dataFrame, and then use groupby + head to get the top N rows per task.

v = df['task'].value_counts().min()
df = df.groupby('task', as_index=False).head(v)

df
  task someValue
0   t1       XXX
1   t1       XXX
4   t2       XXX
5   t2       XXX
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Clear and simple solution, thanks! But sadly it is not random but it should still work for me! :) – Fupp2 Apr 24 '18 at 14:23
  • @Fupp2 You can first do `df = df.sample(frac=1)` and then `df.groupby('task', sort=False, as_index=False).head(v)` if you want random rows – cs95 Apr 24 '18 at 14:25
  • @Fupp2 Sorry, you need to assign it back: `df = df.groupby('task', sort=False, as_index=False).head(v)` – cs95 Apr 24 '18 at 14:40
  • Ah, alright, thanks! However, I'm now thinking random maybe is not the best choice. Is it possible to do this but start removing every second row and then remove the head or something after that? I'm afraid the data will become skewed – Fupp2 Apr 24 '18 at 14:46
  • @Fupp2 I recommend taking a look at this: https://stackoverflow.com/questions/36390406/pandas-sample-each-group-after-groupby it isn't exactly what you asked but may help – cs95 Apr 24 '18 at 14:47