pandas drop function on duplicates is deleting invalid rows

Question

Is the drop function wrong? Need expert guidance on how to fix this:

Rows for certain date may appear in multiple files
I remove rows with duplicate date,time from cs...
...then compare a column (val2), only keeping rows where the first row of val2 is highest

Code:

cs = pd.concat([pd.read_csv(f) for f in fnames])
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

i=0
while len(dp)>0:
    if dp.values[i][3] > dp.values[i+1][3]:
        if dp.index[i] > dp.index[i+1]:
            cs.drop(cs[(cs.date==dp.values[i][0]) & (cs.index < dp.index[i])].index, inplace=True)
            dp = cs[cs.duplicated(['date','time'],keep=False)]
            dp = dp.sort_values(['date','time'],ascending=True)

Sample data:

file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230

Expected output:

date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
21jun,02:00,10,230

Actual output:

date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130

if you have numerics with filename this should work: `df[df.file.eq(df.groupby('date')['file'].transform('max'))]` — anky, Jun 23 '19 at 17:03
As @michcio said, anytime you find yourself manually constructing a dataframe with something like `cs[cs.duplicated(...)]`, that's a code smell you should be using `groupby()`. And then the `sort_values` on the group contents can be done with an aggregate like `.agg({'val': 'max'})` — smci, Jun 23 '19 at 17:12
There are [274 pandas questions on dropping duplicates](https://stackoverflow.com/search?q=pandas+drop+duplicates+is%3Aq), surely this is a duplicate, of which? — smci, Jun 23 '19 at 17:19
Related: [pandas: drop duplicates in groupby 'date'](https://stackoverflow.com/questions/37105609/pandas-drop-duplicates-in-groupby-date) — smci, Jun 23 '19 at 17:32
Your question boils down to: "drop exact duplicates of `['date','time']` columns, like [this](https://stackoverflow.com/questions/37105609/pandas-drop-duplicates-in-groupby-date); then do a compound `groupby(['date','time'])`; summarize/aggregate by max value on `val2` column"* — smci, Jun 23 '19 at 17:34
Hi all, thanks for the comments. The dataset for a date can appear in 2 files, and I only need to use either from file1 or file2. How do I know from which file to extract is based on column val2 of the FIRST row of the date in both files. If the val2 of the first row of date X of file1 is greater than val2 of the first row of same date X of file2, then I will use dataset of date X in file1. And vice versa. I will try to update my sample data. — spx, Jun 24 '19 at 12:22
@anky_91, smci, I think my case is a little bit different, I cannot use the groupby. But I may be wrong though. Please enlighten me. Thanks. — spx, Jun 24 '19 at 12:30
@spx Currently your question is still unclear. E.g. "keeping rows where the first row of val2 is highest" doesn't make much sense. I think I know what you're trying to achieve, but I'd like to be sure. Please edit your question. Try not to suggest the solution (as this is what we are looking for), but rather formulate the problem unambiguously. — michcio1234, Jun 25 '19 at 16:42
@michcio1234 actually i was processing data for futures (like shares/stocks). The futures can have multiple contract periods. Say there are periods June/July/Aug/etc. I just want to 'harvest' data with the most transacted. If June is going to expire, its volume (val2 in my sample data) will decline and July's volume will be greater than the June's. Instead of manually check or hardcode the date, i use the script to extract only the day with the most volume (usually can tell from the opening time => the first row of volume is highest between June and July). Hope it clarifies. Thank you. — spx, Jun 27 '19 at 12:46

michcio1234 · Answer 1 · 2019-06-25T16:56:12.497

EDITED ANSWER:

After discussion in the comments I think this is what you need (I added some code for reproducing the problem):

import pandas as pd
from io import StringIO

input_string = """file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230"""

buf = StringIO(input_string)
cs = pd.read_csv(buf)

def pick_file(df):
    first = df.groupby('file').first()
    file = first['val2'].idxmax()
    return df[df['file'] == file]

result = cs.groupby(['date']).apply(pick_file)

result = result.reset_index(level=0, drop=True)

The result is:

  file   date   time  val1  val2
2   f2  20jun  01:00    10   320
3   f2  20jun  02:00    10    50
4   f2  21jun  01:00    10   130
5   f2  21jun  02:00    10   230

This does some inception: a groupby inside a groupby.

Outer groupby groups by date because these are the groups we want to search for the file in.

Inner groupby searches for the correct file within a group and keeps only the rows from this file.

ORIGINAL ANSWER:

Instead of constructing a dataframe with duplicates and iterating through it, you can just use groupby:

cs = pd.concat([pd.read_csv(f) for f in fnames])
result = cs.groupby(['date', 'time'])\
    .apply(lambda x: x[x['val2']==x['val2'].max()])

It groups all the rows which have the same values in date and time columns and then, for each group, it leaves only the row with the highest value of val2.

The result is:

              file   date   time  val1  val2
date  time                                  
20jun 01:00 2   f2  20jun  01:00    10   320
      02:00 3   f2  20jun  02:00    10   220
21jun 01:00 4   f2  21jun  01:00    10   130
      02:00 5   f2  21jun  02:00    10   230

Hi @michcio1234, thanks for the advise. But I think I cannot use groupby. I updated my sample data above. I think I need to use iterating. But I'm not expert, any workable advise is highly appreciated. — spx, Jun 24 '19 at 12:36

Valdi_Bo · Answer 2 · 2019-06-23T17:42:04.313

To delete the rows you want:

with repeating values of date and time,
keeping the last row from the repeated set,

use:

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

There is no need for "initial sort" of the source rows.

Edit

As you wrote that from each set of repeating rows you want to keep the row with the highest val2:

Add ignore_index=True to pd.concat. This way you will have the "ordered" index, needed to restore the initial row order (in the last step).

Then sort rows:

cs.sort_values(['date','time','val2'])

so that in any repeating group (by date and time), the row with the highest val2 is at the last place.

The third step is:

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

as in my first proposal.

And the final step, to restore the original order of rows, sort them once more, this time by the index (in-place), run:

cs.sort_index(inplace=True)

Hi @Valdi_Bo, thanks for the advise. But I think that won't give my expected result. I have updated my sample case/data above. Sorry if I didn't provide it clearly earlier. — spx, Jun 24 '19 at 12:52

spx · Answer 3 · 2019-06-27T12:50:48.057

The 'invalid' rows got deleted because of the index. After the pd.concat, need to reset the index using cs.reset_index(inplace=True,drop=True). Without the reset index, the index would start from 0 for each file. And it got deleted by the drop function because some of the index values were repeated.

Though I can get the correct result, still the filtering by date column is actually not working ( cs.drop(cs[(cs.date==dp.values[0][0]) ). Should it work, i didn't have to 'reset' the index. Or, did i use it wrongly?

Thanks all for your help. If you have a better and elegant way to get the expected output, that would be appreciated.

Best regards.

cs = pd.concat([pd.read_csv(f) for f in fnames])
cs.reset_index(inplace=True,drop=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

while len(dp)>0:
    if dp.values[0][3] > dp.values[1][3]:
        if dp.index[0] > dp.index[1]:
            cs.drop(cs[(cs.date==dp.values[0][0]) & (cs.index < dp.index[0])].index, inplace=True)
            dp.drop(dp[(dp.date==dp.values[0][0])].index, inplace=True)

pandas drop function on duplicates is deleting invalid rows

3 Answers3

Edit