Is the drop function wrong? Need expert guidance on how to fix this:
- Rows for certain date may appear in multiple files
- I remove rows with duplicate
date,time
fromcs
... - ...then compare a column (
val2
), only keeping rows where the first row ofval2
is highest
Code:
cs = pd.concat([pd.read_csv(f) for f in fnames])
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)
i=0
while len(dp)>0:
if dp.values[i][3] > dp.values[i+1][3]:
if dp.index[i] > dp.index[i+1]:
cs.drop(cs[(cs.date==dp.values[i][0]) & (cs.index < dp.index[i])].index, inplace=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)
Sample data:
file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230
Expected output:
date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
21jun,02:00,10,230
Actual output:
date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130