1

I have a data frame df:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
1   2015-05-11 23:08:46     2015-05-11 23:08:46 http://11i-ssaintandder.com/
2   2015-05-02 18:27:10     2015-06-06 03:52:03 http://goo.gl/NMqjd1
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://goo.gl/NMqjd1

I would like to remove the rows that has the same "first_seen","uri" and keep only the row that has the latest last_seen.

Here is the an example of expected dataset:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://goo.gl/NMqjd1

Does anybody know who to do it without writing a for loop?

UserYmY
  • 8,034
  • 17
  • 57
  • 71

1 Answers1

1

Call drop_duplicates and pass the columns you want to consider for duplicate matching as the args for subset and set param take_last=True:

In [295]:

df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
  index          first_seen            last_seen                           uri
1     1 2015-05-11 23:08:46  2015-05-11 23:08:46  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://goo.gl/NMqjd1

EDIT

In order to take the latest date you need to sort the df first on 'first_seen' and 'last_seen':

n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)

Out[317]:
  index          first_seen            last_seen                           uri
0     0 2015-05-11 23:08:46  2015-05-11 23:08:50  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://goo.gl/NMqjd1
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • @firelynx good point will update but wouldn't you need to sort by first_seen and last_seen? – EdChum Jun 26 '15 at 14:54
  • I don't see why you would, but I would not trust myself too much with this headache I have. – firelynx Jun 26 '15 at 14:58
  • @firelynx I think it would be safer as you could have overlapping dates in last_seen compared with first_seen, it seems to satisfy the OP's requirement now – EdChum Jun 26 '15 at 14:59