For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})
lat lon time duration size
0 49.125 -114.125 2005-08-09 3 4
1 49.125 -114.125 2005-08-09 6 10
I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.
edit: so far I know how to get the highest duration value to remain using:
z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')