How to keep earliest record for each category but without considering the extra columns?

Question

Let's say I have a data table with 3 columns:

Category             Color              Date
triangle             red                2017-10-10
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10
circle               red                2017-09-10

I want to find out the earliest date by each category. So my desired output is:

Category             Color              Date
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10

I've looked through a couple posts about how to do this:

Finding the min date in a Pandas DF row and create new Column

Pandas groupby category, rating, get top value from each category?

With Pandas in Python, select the highest value row for each group

and more.

A popular method is the groupby method:

df.groupby('Category').first().reset_index()

But if I use this method, then it'll group by Category, but it'll keep both records for triangle since it has two different colors.

Is there a better and more efficient way to do this?

cs95 · Accepted Answer · 2017-11-20T22:52:16.347

3

You could use sort_values + drop_duplicates:

df.sort_values(['Date']).drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

If you want to preserve the original order of Category, you'll need to sort on a groupby call:

df.groupby('Category', group_keys=False, sort=False)\
  .apply(lambda x: x.sort_values('Date'))\
  .drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
1    square  yellow  2017-11-10
3    circle  yellow  2017-07-10

edited Nov 20 '17 at 22:52

answered Nov 20 '17 at 22:46

cs95

379,657
97
704
746

Should have refreshed the page... But I guess solutions are still not too similar. – Cleb Nov 20 '17 at 23:09
ah i was almost right! but i used "date" inside the duplicates... so i thogutht it was the wrong method. thanks so much for the help, AGAIN! – alwaysaskingquestions Nov 20 '17 at 23:20

Cleb · Answer 2 · 2017-11-20T23:36:53.833

3

The following should give you the desired output; compare to what you posted I first sort the values according to the date as you want to keep the earliest date per category:

df.sort_values('Date').groupby('Category').first().reset_index()

That gives the desired output:

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

EDIT

Thanks to @Wen in the comments, one can make this call also more efficient by doing:

df.sort_values('Date').groupby('Category', as_index=False).first()

which also gives

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

edited Nov 20 '17 at 23:36

answered Nov 20 '17 at 23:06

Cleb

25,102
20
116
151

1

thank you very much as well! this helps me understand a 2nd option. really helpful! choosing the first posted working answer. but i really appreciate your help still. – alwaysaskingquestions Nov 20 '17 at 23:27
adding `as_index ` `df.sort_values('Date').groupby('Category',as_index=False).first()` :-) – BENY Nov 20 '17 at 23:32
@Wen: That's nice, will add it! – Cleb Nov 20 '17 at 23:34

BENY · Answer 3 · 2017-11-20T23:36:44.267

3

head will return you original columns

df.sort_values(['Date']).groupby('Category').head(1)
Out[156]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

nth as well:

df.sort_values(['Date']).groupby('Category',as_index=False).nth(0)
Out[158]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

Or idxmin

df.loc[df.groupby('Category').Date.idxmin()]
Out[166]: 
   Category   Color       Date
3    circle  yellow 2017-07-10
1    square  yellow 2017-11-10
2  triangle    blue 2017-02-10

edited Nov 20 '17 at 23:36

answered Nov 20 '17 at 23:23

BENY

317,841
20
164
234

thank you! but the other ones are earlier so i'll choose one from the earliest answer. but i really do appreciate your help still! – alwaysaskingquestions Nov 20 '17 at 23:26
@alwaysaskingquestions that is ok , I am here to help :-) happy coding – BENY Nov 20 '17 at 23:28
@alwaysaskingquestions provide you one more option :-) – BENY Nov 20 '17 at 23:38

How to keep earliest record for each category but without considering the extra columns?

3 Answers3