Filter data using max. categorical value of a group in pandas

Question

Good day, I have this data set of customers with their region and Date of Birth (dob). Some customers have more than a region assigned, so I need to filter them out and keep only last region on pandas (using max value of string would be good). I tried to filter using group-by but I only manage to display customer and region, and I need to display the whole row (including dob)

Thanks in advance.

Data:

import numpy as np
import pandas as pd 
data = [['A', 'FL', '2000-06-01'], ['B', 'FL', '1999-05-01'], ['C', 'FL', '2000-02-03'], ['C', 'NY', '2000-02-03'], ['D', 'FL', '1990-03-01'], ['E', 'NY', '1999-12-07'], ['E', 'TX', '1999-12-07'], ['F', 'FL', '1995-06-07'], ['F', 'TX', '1995-06-07'],]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['customer', 'region', 'dob'])

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   FL  2000-02-03
3   C   NY  2000-02-03
4   D   FL  1990-03-01
5   E   NY  1999-12-07
6   E   TX  1999-12-07
7   F   FL  1995-06-07
8   F   TX  1995-06-07

My take:

df.groupby(['customer'], sort=False)['region'].max()

Partial Output

customer
A    FL
B    FL
C    NY
D    FL
E    TX
F    TX

Desired output:

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   NY  2000-02-03
3   D   FL  1990-03-01
4   E   TX  1999-12-07
5   F   TX  1995-06-07

Potential duplicate in [python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B](https://stackoverflow.com/q/12497402/15497888) — Henry Ecker, Aug 29 '21 at 01:55

score 1 · Accepted Answer · answered Aug 29 '21 at 00:52

If all the values other than region are the same for each customer, you can use df.groupby('customer').max(); it's the ['region'] part that's restricting the columns to just region. (Also, you can use just use customer, rather than a list containing customer). Also, max returns the alphabetically last element. If you want the value from the last row, you'll need something different.

score 1 · Answer 2 · answered Aug 29 '21 at 00:53

I think what you need is the tail -> it returns the last row(s) for each grouping, in this case you need just the last (1) row per grouping:

df.groupby('customer').tail(1)
Out[279]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
3        C     NY  2000-02-03
4        D     FL  1990-03-01
6        E     TX  1999-12-07
8        F     TX  1995-06-07

score 1 · Answer 3 · edited Aug 29 '21 at 01:53

1

You can do a drop duplicates instead of grouping. Use this code

df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)

The reset index is just to re-number the index after duplicates are removed. The output will be as follows:

In [12]: df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)
Out[12]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
2        C     NY  2000-02-03
3        D     FL  1990-03-01
4        E     TX  1999-12-07
5        F     TX  1995-06-07

edited Aug 29 '21 at 01:53

Henry Ecker

34,399
18
41
57

answered Aug 29 '21 at 01:00

aj7amigo

368
2
12

2

Pro tip, as of `pandas 1.0.0` drop_duplicates has an `ignore_index` keyword argument which allows the index to be ignored without chaining a `reset_index` call. `df.drop_duplicates(subset=['customer'], keep='last', ignore_index=True)` Source: [What’s new in 1.0.0 (January 29, 2020)](https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.0.0.html#other-enhancements) – Henry Ecker Aug 29 '21 at 01:49
If you wish to sort the data using a variable other than the default and then obtain the row with the maximum value relative to another variable, the following code can be utilized. `df.sort_values(['region'],ascending=True).drop_duplicates(subset= ['customer'], keep='last').reset_index(drop=True)` – Surender Singh Mar 14 '23 at 11:50

Filter data using max. categorical value of a group in pandas

3 Answers3