12

I have a table like this:

CustID  Purchase  Time
A       Item1     01/01/2011
B       Item2     01/01/2011   
C       Item1     01/02/2011   
A       Item2     03/01/2011   

I would like to select rows with CustID appear more than 1 in the table.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Hai Vu
  • 197
  • 1
  • 9

3 Answers3

22

This could work:

counts = df['CustID'].value_counts()
df[df['CustID'].isin(counts.index[counts > 1])]

Result:

  CustID Purchase        Time
0      A    Item1  01/01/2011
3      A    Item2  03/01/2011
languitar
  • 6,554
  • 2
  • 37
  • 62
  • 3
    This was my approach, plus one. To make it more efficient, you can pass `sort=False` to `value_counts` – piRSquared Apr 11 '17 at 14:09
  • Nice answer! Your brace/parenth on the second line are backwards, though. I'd edit myself but want to avoid the risk of destroying your formatting on my phone :) – miradulo Apr 11 '17 at 19:00
15
df[df['CustID'].duplicated(keep=False)]

This finds the rows in the data frame where there exist duplicates in the CustID column. The keep=False tells the duplicated function to mark all duplicate rows as True (as opposed to just the first or last ones):

  CustID Purchase        Time
0      A    Item1  01/01/2011
3      A    Item2  03/01/2011

EDIT

Looking at the docs for duplicated it looks like you can also do:

df[df.duplicated('CustID', keep=False)]

Though this seems to be about 100 µs slower than the original (458 µs vs. 545 µs based on the example dataframe)

Whymarrh
  • 13,139
  • 14
  • 57
  • 108
bunji
  • 5,063
  • 1
  • 17
  • 36
11

Use filter

df.groupby('CustID').filter(lambda x: len(x) > 1)
  CustID Purchase        Time
0      A    Item1  01/01/2011
3      A    Item2  03/01/2011
Whymarrh
  • 13,139
  • 14
  • 57
  • 108
piRSquared
  • 285,575
  • 57
  • 475
  • 624