1

Let's say I have a data table with 3 columns:

Category             Color              Date
triangle             red                2017-10-10
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10
circle               red                2017-09-10

I want to find out the earliest date by each category. So my desired output is:

Category             Color              Date
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10

I've looked through a couple posts about how to do this:

Finding the min date in a Pandas DF row and create new Column

Pandas groupby category, rating, get top value from each category?

With Pandas in Python, select the highest value row for each group

and more.

A popular method is the groupby method:

df.groupby('Category').first().reset_index() 

But if I use this method, then it'll group by Category, but it'll keep both records for triangle since it has two different colors.

Is there a better and more efficient way to do this?

Cleb
  • 25,102
  • 20
  • 116
  • 151
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

3 Answers3

3

You could use sort_values + drop_duplicates:

df.sort_values(['Date']).drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

If you want to preserve the original order of Category, you'll need to sort on a groupby call:

df.groupby('Category', group_keys=False, sort=False)\
  .apply(lambda x: x.sort_values('Date'))\
  .drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
1    square  yellow  2017-11-10
3    circle  yellow  2017-07-10
cs95
  • 379,657
  • 97
  • 704
  • 746
3

The following should give you the desired output; compare to what you posted I first sort the values according to the date as you want to keep the earliest date per category:

df.sort_values('Date').groupby('Category').first().reset_index()

That gives the desired output:

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

EDIT

Thanks to @Wen in the comments, one can make this call also more efficient by doing:

df.sort_values('Date').groupby('Category', as_index=False).first()

which also gives

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10
Cleb
  • 25,102
  • 20
  • 116
  • 151
3

head will return you original columns

df.sort_values(['Date']).groupby('Category').head(1)
Out[156]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

nth as well:

df.sort_values(['Date']).groupby('Category',as_index=False).nth(0)
Out[158]: 
   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

Or idxmin

df.loc[df.groupby('Category').Date.idxmin()]
Out[166]: 
   Category   Color       Date
3    circle  yellow 2017-07-10
1    square  yellow 2017-11-10
2  triangle    blue 2017-02-10
BENY
  • 317,841
  • 20
  • 164
  • 234