3

I want to ignore the rows is the occupation has less than 2 unique names:

name        value      occupation
   a           23      mechanic
   a           24      mechanic
   b           30      mechanic
   c           40      mechanic
   c           41      mechanic
   d           30      doctor
   d           20      doctor
   e           70      plumber
   e           71      plumber
   f           30      plumber
   g           50      tailor

I did:

df.groupby('ocuupation')['name'].nunique()
>>>>>>
occupation
mechanic   3
doctor     1
plumber    2
tailor     1
Name: name, dtype: int64

Is it possible to use something like df = df.drop(df[<some boolean condition>].index)?

Desired output:

name        value      occupation
   a           23      mechanic
   a           24      mechanic
   b           30      mechanic
   c           40      mechanic
   c           41      mechanic
   e           70      plumber
   e           71      plumber
   f           30      plumber
futuredataengineer
  • 442
  • 1
  • 3
  • 14

1 Answers1

6

Use GroupBy.transform with Series.ge for get values equal of greater like 2:

df = df[df.groupby('occupation')['name'].transform('nunique').ge(2)]
print (df)
  name  value occupation
0    a     23   mechanic
1    a     24   mechanic
2    b     30   mechanic
3    c     40   mechanic
4    c     41   mechanic
7    e     70    plumber
8    e     71    plumber
9    f     30    plumber

Your solution is filtered values of index in Series compared in Series.isin:

s = df.groupby('occupation')['name'].nunique()

df = df[df['occupation'].isin(s[s.ge(2)].index)]
print (df)
  name  value occupation
0    a     23   mechanic
1    a     24   mechanic
2    b     30   mechanic
3    c     40   mechanic
4    c     41   mechanic
7    e     70    plumber
8    e     71    plumber
9    f     30    plumber
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252