Summing values of (dropped) duplicate rows Pandas DataFrame

Question

For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.

z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})

    lat     lon         time        duration    size
0   49.125  -114.125    2005-08-09  3           4
1   49.125  -114.125    2005-08-09  6           10

I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:

    lat     lon         time        duration    size
0   49.125  -114.125    2005-08-09  6           14

Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.

edit: so far I know how to get the highest duration value to remain using:

z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')

sophocles · Accepted Answer · 2021-03-31T16:41:09.640

3

If those are all the columns in your dataframe, you can get your result using a groupbyon your time column, and passing in your aggregations for each column.

More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max() duration, and at the same time sum the 'size' variables by using sum() on your size column.

res = z.groupby('time').agg({'lat':'first',
                                           'lon':'first',
                                           'duration':'max',
                                           'size':'sum'}).  \
    reset_index()

res

        time     lat      lon  duration  size
0 2005-08-09  49.125 -114.125         6    14

The only difference is that 'time' is now your first column, which you can quickly fix.

edited Mar 31 '21 at 16:41

answered Mar 31 '21 at 13:50

sophocles

13,593
3
14
33

1

Thank you for your answer! Really clean code. I will also delve into the functionalities of 'agg' to better understand what you are doing! – John van de Ven Mar 31 '21 at 14:05
If you got multiple records at different lat and lon at the same time, will it work or would it only takes the first location encountered? – politinsa Mar 31 '21 at 14:06
1

This is a good point, and no it will only keep the first instance. So my solution would work if the dataframe posted is a complete representation of the real life scenario. Note however that you can use arithmetic operations like ```mean``` or ```median``` etc in case you have many values and you want to summarise. – sophocles Mar 31 '21 at 14:09
So @politinsa made a good comment! My dataset consists of 1800 images (days) with a 46 by 42 grid. The example I provided, was just one timepoint for a specific cell. You are telling me that this doesn't work for my entire dataset? – John van de Ven Mar 31 '21 at 14:22
so long story short, if what you posted represents your dataset, i.e. for a single day you have the same 'lat' and 'lon', this will work perfectly. If you have different 'lat' and 'lon' per day, you will have to tweak the aggregation in the ```agg``` function and return what you want accordingly. I would suggest having a read about ```pandas.groupby.aggregate```, it will clear things out :) – sophocles Mar 31 '21 at 14:30
Will do! I just tried to group by ['time','lat', 'lon'] and that does seem to do the trick as well. However, I do have to check if it is really the case. And I will definitely check out the function! Thanks! – John van de Ven Mar 31 '21 at 14:34
Exactly you can add extra variables in your ```groupby```. Try it out, and let me know if you want to discuss further – sophocles Mar 31 '21 at 14:35
1

I added this method in my answer ;) Sort by duration first since groupby preserve the order of the rows and it's a one liner – politinsa Mar 31 '21 at 14:41

politinsa · Answer 2 · 2021-03-31T14:45:23.957

Group by to get the sum and merge back on unique values on the df without duplicates:

import pandas as pd
import numpy as np

z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]}) # original data

gp = z.groupby(['lat', 'lon','time'], as_index=False)[['size']].sum() # getting the sum of 'size' for unique combination of lat, lon, time

df = z.sort_values(by='duration', ascending=True).drop_duplicates(subset=['lat', 'lon','time'], keep='last') # dropping duplicates

pd.merge(df[['lat', 'lon', 'time', 'duration']], gp, on=['lat', 'lon', 'time']) # adding the columns summed onto the df without duplicates

      lat      lon       time  duration  size
0  49.125 -114.125 2005-08-09         6    14

Another way base on sophocles answer:

res = z.sort_values(by='duration', ascending=False).groupby(['time', 'lat', 'lon']).agg({
                                           'duration':'first', # same as 'max' since we've sorted the data by duration DESC
                                           'size':'sum'})

This one could become less readable if you have several columns you want to keep (you'd have a lot of first in the agg function)

It works! Thanks for the help! Greatly appreciated! – John van de Ven Mar 31 '21 at 14:04 — John van de Ven, Mar 31 '21 at 14:04

Summing values of (dropped) duplicate rows Pandas DataFrame

2 Answers2