6

I'm trying to get the latest occurrence of an ID Name and I want to delete every other occurrence that happened before that time.

    ID Name     Comment        Time
0     W12D0       Fine     12:17:37
1     W12D0     Random     12:20:10
2     W12D0       What     12:21:06
3     W12D4       Fine     08:20:14
4     W12D5     Random     10:11:12
5     W12D5       Fine     11:37:02
..      ...        ...         ....

For example (according to the data above), the 'ID Name', 'W12D0', is associated with 3 occurrences: 12:17:37 , 12:20:10 , 12:21:06

I want to only keep the row associated to that ID Name's latest time (in this case, it's 12:21:06). Every other row with W12D0 will be deleted. Essentially, I want something like this:

    ID Name     Comment        Time
0     W12D0       What     12:21:06
1     W12D4       Fine     08:20:14
2     W12D5       Fine     11:37:02
..      ...        ...         ....

How would I go about doing this?

anonymous
  • 815
  • 3
  • 13
  • 21

4 Answers4

13

A more recent version (as of Aug 2018) of this would look like the following:

df = df.sort_values(by=['IDName', 'Time'])
df.drop_duplicates(subset=['IDName'], keep='last')

Just in case anyone comes across this and struggles to get it to work

daz-fuller
  • 1,191
  • 1
  • 10
  • 18
12

If "df" is your dataframe and "IDName" the column then you can try:

First you sort:

df = df.sort(columns=["IDName", "Time"])

Then drop duplicates:

  df.drop_duplicates(subset=["IDName"], take_last=False)

You can read more here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Skorpeo
  • 2,362
  • 2
  • 15
  • 20
4

Quick and dirty but should probably do the job!

import pandas as pd
import numpy as np

# load your data

data['Time'] = pd.to_datetime(data['Time'])

list_of_indexes = [np.argmin(g['Time']) for l, g in data.groupby('ID Name')]
data.ix[list_of_indexes]
1

This should perform what you want. I usually use CSV format to load data.

from pandas import *
import numpy as np

# load your data

df = read_cv('data.csv')
df['Time'] = to_datetime(df['Time'])

# remove duplicate

df = df.groupby('IDName', group_keys=False).apply(lambda x: x.ix[np.argmax(x['Time'])])

This link might also be helpful for you: remove duplicate

Community
  • 1
  • 1
AmirC
  • 326
  • 5
  • 14