Remove rows from Pandas dataframe where value only appears once

Question

I have a dataframe with monthly records for different IDs, and I need to do some analysis only on the IDs that have multiple months of records.

ID    Month       Metric1    Metric2
1     2018-01-01  4          3      
1     2018-02-01  3          2
2     2018-02-01  1          5
3     2018-01-01  4          2
3     2018-02-01  6          3
4     2018-01-01  3          1

How would I filter out the rows of ID that only appear once and keep those with multiple rows and get a result like

ID    Month       Metric1    Metric2
1     2018-01-01  4          3      
1     2018-02-01  3          2
3     2018-01-01  4          2
3     2018-02-01  6          3

I've looked at some other pages that mention using something like:

df = df[df.groupby('ID').ID.transform(len) > 1]

But I don't want to lose the metrics from each month by grouping.

score 11 · Accepted Answer · answered Apr 26 '19 at 16:25

11

Change the len to count

df[df.groupby('ID').ID.transform('count') > 1]
Out[589]: 
   ID       Month  Metric1  Metric2
0   1  2018-01-01        4        3
1   1  2018-02-01        3        2
3   3  2018-01-01        4        2
4   3  2018-02-01        6        3

answered Apr 26 '19 at 16:25

BENY

317,841
20
164
234

This answer is nice because it's flexible; we can also select on other frequencies or frequency ranges. – Niels Bom Jun 02 '21 at 13:23

score 6 · Answer 2 · answered Apr 26 '19 at 16:21

6

Try with pd.series.duplicated():

df1=df[df.ID.duplicated(keep=False)]
print(df1)

   ID       Month  Metric1  Metric2
0   1  2018-01-01        4        3
1   1  2018-02-01        3        2
3   3  2018-01-01        4        2
4   3  2018-02-01        6        3

answered Apr 26 '19 at 16:21

anky

74,114
11
41
70

1

I'd say this answer is a bit simpler/cleaner than @BENY's answer. – Niels Bom Jun 02 '21 at 13:22

piRSquared · Answer 3 · 2019-04-26T16:59:54.940

`filter`

I cannot vouche for the speed of this but this is what this API was intended for...

df.groupby('ID').filter(lambda d: len(d) > 1)

   ID       Month  Metric1  Metric2
0   1  2018-01-01        4        3
1   1  2018-02-01        3        2
3   3  2018-01-01        4        2
4   3  2018-02-01        6        3

Numpy'd version of @Wen-Ben's answer

u, i = np.unique(df.ID.values, return_inverse=True)

df[np.bincount(i)[i] > 1]

   ID       Month  Metric1  Metric2
0   1  2018-01-01        4        3
1   1  2018-02-01        3        2
3   3  2018-01-01        4        2
4   3  2018-02-01        6        3

Because I was curious...

s0 = set()
s1 = set()

for i in df.ID:
    if i in s0:
        s1.add(i)
    s0.add(i)

df[df.ID.map(s1.__contains__)]

   ID       Month  Metric1  Metric2
0   1  2018-01-01        4        3
1   1  2018-02-01        3        2
3   3  2018-01-01        4        2
4   3  2018-02-01        6        3

Remove rows from Pandas dataframe where value only appears once

3 Answers3

`filter`

Linked