How to keep the most recent values and delete all other others in a column (pandas)

Question

I'm trying to get the latest occurrence of an ID Name and I want to delete every other occurrence that happened before that time.

    ID Name     Comment        Time
0     W12D0       Fine     12:17:37
1     W12D0     Random     12:20:10
2     W12D0       What     12:21:06
3     W12D4       Fine     08:20:14
4     W12D5     Random     10:11:12
5     W12D5       Fine     11:37:02
..      ...        ...         ....

For example (according to the data above), the 'ID Name', 'W12D0', is associated with 3 occurrences: 12:17:37 , 12:20:10 , 12:21:06

I want to only keep the row associated to that ID Name's latest time (in this case, it's 12:21:06). Every other row with W12D0 will be deleted. Essentially, I want something like this:

    ID Name     Comment        Time
0     W12D0       What     12:21:06
1     W12D4       Fine     08:20:14
2     W12D5       Fine     11:37:02
..      ...        ...         ....

How would I go about doing this?

score 13 · Answer 1 · answered Aug 13 '18 at 11:02

13

A more recent version (as of Aug 2018) of this would look like the following:

df = df.sort_values(by=['IDName', 'Time'])
df.drop_duplicates(subset=['IDName'], keep='last')

Just in case anyone comes across this and struggles to get it to work

answered Aug 13 '18 at 11:02

daz-fuller

1,191
1
10
18

Skorpeo · Answer 2 · 2015-08-14T23:39:46.477

12

If "df" is your dataframe and "IDName" the column then you can try:

First you sort:

df = df.sort(columns=["IDName", "Time"])

Then drop duplicates:

  df.drop_duplicates(subset=["IDName"], take_last=False)

You can read more here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

edited Aug 14 '15 at 23:39

answered Aug 14 '15 at 05:41

Skorpeo

2,362
2
15
20

score 4 · Accepted Answer · answered Aug 13 '15 at 23:45

4

Quick and dirty but should probably do the job!

import pandas as pd
import numpy as np

# load your data

data['Time'] = pd.to_datetime(data['Time'])

list_of_indexes = [np.argmin(g['Time']) for l, g in data.groupby('ID Name')]
data.ix[list_of_indexes]

answered Aug 13 '15 at 23:45

Eugene Yurtsev

134
3

Please note `.ix` is deprecated. Use `.iloc` instead – Aziz Alto Nov 13 '18 at 13:35

score 1 · Answer 4 · edited May 23 '17 at 12:31

1

This should perform what you want. I usually use CSV format to load data.

from pandas import *
import numpy as np

# load your data

df = read_cv('data.csv')
df['Time'] = to_datetime(df['Time'])

# remove duplicate

df = df.groupby('IDName', group_keys=False).apply(lambda x: x.ix[np.argmax(x['Time'])])

This link might also be helpful for you: remove duplicate

edited May 23 '17 at 12:31

Community

1
1

answered Aug 13 '15 at 23:54

AmirC

326
5
14

How to keep the most recent values and delete all other others in a column (pandas)

4 Answers4