0

I am looking for a pandas equivalent to the SQL Redshift window function LAST_VALUE().

I have a pandas dataframe of serial number reports that append daily.

import pandas as pd

data = {'serial_num': [123456, 678901, 123456, 678901], 
'status': ['Good', 'Good', 'BAD', 'BAD'], 
'last_check':['2020-03-02','2020-03-02','2020-03-01','2020-03-01']}
new_br = pd.DataFrame.from_dict(data)


new_br

serial_num status last_check
123456     Good   2020-03-02
678901     Good   2020-03-02
123456     BAD    2020-03-01
678901     BAD    2020-03-01

I want the max value of last_check, grouped by serial_num, and preserve all columns (My actual dataset has more columns).

My code so far is:

new_br.set_index('last_check').groupby('serial_num').max()

serial_num  status
123456      Good
678901      Good

However, this removes the last_check column. How can I preserve the date column, similar to the LAST_VALUE() function in SQL Redshift?

My expected output is:

serial_num  status last_check
123456      Good    2020-03-02
678901      Good    2020-03-02

sophros
  • 14,672
  • 11
  • 46
  • 75
Jacky
  • 710
  • 2
  • 8
  • 27
  • 1
    kindly post ur expected output in dataframe format. – sammywemmy Mar 12 '20 at 19:36
  • Try `new_br.loc[new_br.groupby('serial_num')['last_check'].idxmax()]` – Chris Adams Mar 12 '20 at 19:41
  • I think you'll basically find all of the possible methods in: https://stackoverflow.com/questions/54717473/python-3-pandas-groupby-filter. Can sort on the date then groupby+tail or drop duplicates. – ALollz Mar 12 '20 at 19:43

1 Answers1

4

Use groupby.idxmax with loc:

data = {'serial_num': [123456, 678901, 123456, 678901], 
'status': ['Good', 'Good', 'BAD', 'BAD'], 
'last_check':['2020-03-02','2020-03-02','2020-03-01','2020-03-01']}
new_br = pd.DataFrame.from_dict(data)

print(new_br.dtypes)

# serial_num     int64
# status        object
# last_check    object
# dtype: object

# if last_check is not datetime dtype run this first
new_br['last_check'] = pd.to_datetime(new_br['last_check'])

new_br.loc[new_br.groupby('serial_num')['last_check'].idxmax()]

[out]

   serial_num status last_check
0      123456   Good 2020-03-02
1      678901   Good 2020-03-02
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • I received this error when trying this out on my sample code: AttributeError: 'SeriesGroupBy' object has no attribute '_aggregate_item_by_item' – Jacky Mar 12 '20 at 19:47
  • `import pandas as pd data = {'serial_num': [123456, 678901, 123456, 678901], 'status': ['Good', 'Good', 'BAD', 'BAD'], 'last_check':['2020-03-02','2020-03-02','2020-03-01','2020-03-01']} new_br = pd.DataFrame.from_dict(data) new_br.loc[new_br.groupby('serial_num')['last_check'].idxmax()]` – Jacky Mar 12 '20 at 19:51
  • 1
    you need to convert `last_check` column to datetime first – kennyvh Mar 12 '20 at 19:52
  • 1
    You forgot `new_br['last_check'] = pd.to_datetime(new_br['last_check'])` – Chris Adams Mar 12 '20 at 19:53
  • @ChrisA I realized that this answer coerces serial_num to an int. serial_num in the actual dataset is an object. Is there a way to preserve serial_num dtype? – Jacky Mar 12 '20 at 20:29
  • It's not coercing. `serial_num` is an int in the example data you've supplied (check for yourself `print(new_br.dtypes)`)... All this is doing is using `loc` to filter your original dataframe by index name. No dtypes should be being coerced to anything else. But if you mean how to fix your sample data, just put quotes around the `serial_number` values - `'serial_num': ['123456', '678901', '123456', '678901']`. – Chris Adams Mar 12 '20 at 21:01