1

Good day, I have this data set of customers with their region and Date of Birth (dob). Some customers have more than a region assigned, so I need to filter them out and keep only last region on pandas (using max value of string would be good). I tried to filter using group-by but I only manage to display customer and region, and I need to display the whole row (including dob)

Thanks in advance.

Data:

import numpy as np
import pandas as pd 
data = [['A', 'FL', '2000-06-01'], ['B', 'FL', '1999-05-01'], ['C', 'FL', '2000-02-03'], ['C', 'NY', '2000-02-03'], ['D', 'FL', '1990-03-01'], ['E', 'NY', '1999-12-07'], ['E', 'TX', '1999-12-07'], ['F', 'FL', '1995-06-07'], ['F', 'TX', '1995-06-07'],]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['customer', 'region', 'dob']) 

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   FL  2000-02-03
3   C   NY  2000-02-03
4   D   FL  1990-03-01
5   E   NY  1999-12-07
6   E   TX  1999-12-07
7   F   FL  1995-06-07
8   F   TX  1995-06-07

My take:

df.groupby(['customer'], sort=False)['region'].max()

Partial Output

customer
A    FL
B    FL
C    NY
D    FL
E    TX
F    TX

Desired output:

    customer    region  dob
0   A   FL  2000-06-01
1   B   FL  1999-05-01
2   C   NY  2000-02-03
3   D   FL  1990-03-01
4   E   TX  1999-12-07
5   F   TX  1995-06-07
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57

3 Answers3

1

If all the values other than region are the same for each customer, you can use df.groupby('customer').max(); it's the ['region'] part that's restricting the columns to just region. (Also, you can use just use customer, rather than a list containing customer). Also, max returns the alphabetically last element. If you want the value from the last row, you'll need something different.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
1

I think what you need is the tail -> it returns the last row(s) for each grouping, in this case you need just the last (1) row per grouping:

df.groupby('customer').tail(1)
Out[279]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
3        C     NY  2000-02-03
4        D     FL  1990-03-01
6        E     TX  1999-12-07
8        F     TX  1995-06-07
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
1

You can do a drop duplicates instead of grouping. Use this code

df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)

The reset index is just to re-number the index after duplicates are removed. The output will be as follows:

In [12]: df.drop_duplicates(subset=['customer'], keep='last').reset_index(drop=True)
Out[12]: 
  customer region         dob
0        A     FL  2000-06-01
1        B     FL  1999-05-01
2        C     NY  2000-02-03
3        D     FL  1990-03-01
4        E     TX  1999-12-07
5        F     TX  1995-06-07
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
aj7amigo
  • 368
  • 2
  • 12
  • 2
    Pro tip, as of `pandas 1.0.0` drop_duplicates has an `ignore_index` keyword argument which allows the index to be ignored without chaining a `reset_index` call. `df.drop_duplicates(subset=['customer'], keep='last', ignore_index=True)` Source: [What’s new in 1.0.0 (January 29, 2020)](https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.0.0.html#other-enhancements) – Henry Ecker Aug 29 '21 at 01:49
  • If you wish to sort the data using a variable other than the default and then obtain the row with the maximum value relative to another variable, the following code can be utilized. `df.sort_values(['region'],ascending=True).drop_duplicates(subset= ['customer'], keep='last').reset_index(drop=True)` – Surender Singh Mar 14 '23 at 11:50