Downsample dataframe correctly

Question

I am having a input dataset, see an input sample bellow, and I want to downsample it. To do so I am using

resample_time=25
init_len = len(df.index)
df = df.set_index('time', drop=False).resample('{}S'.format(resample_time)).last().dropna()
df.index = range(0, len(df.index))

A sample of the output can be found below. However, the output I am getting is not the expected, see expected output bellow. That is to keep a row every 25 seconds. Could you please someone explain why is that happening and how we can fix it?

Input:

          lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317233  40.075250 2007-05-06 04:21:53
3    116.317217  40.075417 2007-05-06 04:22:04
4    116.317133  40.075567 2007-05-06 04:22:23
5    116.317167  40.075400 2007-05-06 04:46:48
6    116.317233  40.075183 2007-05-06 04:46:54
7    116.317050  40.074933 2007-05-06 04:47:00
8    116.313567  40.073983 2007-05-06 04:47:36
9    116.311133  40.073167 2007-05-06 04:48:44
10   116.308017  40.072300 2007-05-06 04:49:15
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

Output:

            lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317217  40.075417 2007-05-06 04:22:04
3    116.317133  40.075567 2007-05-06 04:22:23
4    116.317050  40.074933 2007-05-06 04:47:00
5    116.313567  40.073983 2007-05-06 04:47:36
6    116.311133  40.073167 2007-05-06 04:48:44
7    116.307467  40.072483 2007-05-06 04:49:22
8    116.306450  40.074283 2007-05-06 04:49:52
9    116.308567  40.071850 2007-05-06 04:50:30
10   116.308667  40.071650 2007-05-06 04:50:57
11   116.310450  40.068850 2007-05-06 04:51:38
12   116.311800  40.067717 2007-05-06 04:52:02
13   116.312300  40.067067 2007-05-06 04:52:21
14   116.312667  40.066617 2007-05-06 04:52:32
15   116.312800  40.066450 2007-05-06 04:53:05
16   116.314067  40.064867 2007-05-06 04:53:38
17   116.314783  40.063667 2007-05-06 04:54:14
18   116.315867  40.062167 2007-05-06 04:54:41
19   116.318550  40.058583 2007-05-06 04:55:20

Expected output:

        lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12 -> Include
1    116.317067  40.075217 2007-05-06 04:21:33 -> Exclude
2    116.317233  40.075250 2007-05-06 04:21:53 -> Include
3    116.317217  40.075417 2007-05-06 04:22:04 -> Exclude
4    116.317133  40.075567 2007-05-06 04:22:23 -> Include
5    116.317167  40.075400 2007-05-06 04:46:48 -> Include 
6    116.317233  40.075183 2007-05-06 04:46:54 -> Exclude
7    116.317050  40.074933 2007-05-06 04:47:00 -> Exclude
8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude
10   116.308017  40.072300 2007-05-06 04:49:15 -> Include
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

PS: You can read the .csv file in the link using pd.read_csv(' 20070506033305.csv'), parse_dates=['time'])

@jezrael thanks for the interest. I did an edit with an example of the expected output. — Darkmoor, Sep 13 '22 at 09:11

score 1 · Accepted Answer · answered Sep 13 '22 at 12:03

Based on your expected output, you seem to want to do this:

Starting at the top set a timedelta threshold of 25 seconds and find the first subsequent Timestamp that crosses the threshold.
Reset the threshold based on the newly found value and continue through to the end.

Perhaps there is a better way to achieve this result, but the following should work.

Data

import pandas as pd

data = {'lon': {0: 116.317117, 1: 116.317067, 2: 116.317233, 3: 116.317217, 
                4: 116.317133, 5: 116.317167, 6: 116.317233, 7: 116.31705, 
                8: 116.313567, 9: 116.311133, 10: 116.308017, 11: 116.307467, 
                12: 116.30625, 13: 116.30645}, 
        'lat': {0: 40.075417, 1: 40.075217, 2: 40.07525, 3: 40.075417, 
                4: 40.075567, 5: 40.0754, 6: 40.075183, 7: 40.074933, 
                8: 40.073983, 9: 40.073167, 10: 40.0723, 11: 40.072483, 
                12: 40.074017, 13: 40.074283}, 
        'time': {0: '2007-05-06 04:21:12', 1: '2007-05-06 04:21:33', 
                 2: '2007-05-06 04:21:53', 3: '2007-05-06 04:22:04', 
                 4: '2007-05-06 04:22:23', 5: '2007-05-06 04:46:48', 
                 6: '2007-05-06 04:46:54', 7: '2007-05-06 04:47:00', 
                 8: '2007-05-06 04:47:36', 9: '2007-05-06 04:48:44', 
                 10: '2007-05-06 04:49:15', 11: '2007-05-06 04:49:22', 
                 12: '2007-05-06 04:49:45', 13: '2007-05-06 04:49:52'}
        }

df = pd.DataFrame(data)

df['time'] = pd.to_datetime(df.time)

Code

from datetime import timedelta

index_lst = [0]
for i in df.index[:-1]:
    if i in index_lst:
        check = df.time[i+1:].gt(df.time[i]+timedelta(seconds=25))
        if all(~check):
            break
        index_lst.append(check.idxmax())

res = df.iloc[index_lst].reset_index(drop=True)

print(res)

          lon        lat                time
0  116.317117  40.075417 2007-05-06 04:21:12
1  116.317233  40.075250 2007-05-06 04:21:53
2  116.317133  40.075567 2007-05-06 04:22:23
3  116.317167  40.075400 2007-05-06 04:46:48
4  116.313567  40.073983 2007-05-06 04:47:36
5  116.311133  40.073167 2007-05-06 04:48:44
6  116.308017  40.072300 2007-05-06 04:49:15
7  116.306250  40.074017 2007-05-06 04:49:45

Explanation

Let's say i == 0, we get:

df.time[1:].gt(df.time[0]+timedelta(seconds=25)).head()

1    False
2     True # namely: '2007-05-06 04:21:53'
3     True
4     True
5     True

With idxmax() we get 2 for the first True and we append this value to our list.
On the next iteration we will now skip 1 (not in the list) and restart at 2 (now in the list).

Adding if all(~check): break allows us to step out of the For Loop when we know we won't find any True values anymore.

N.B. I'm assuming here that the following call results from a misread:

8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude

But 9 should be included as well (diff being 1 minute and 8 seconds).

Thanks for the answer. Yes there is a misread in the these two samples you mention. Please give a few minutes to understand your answer. — Darkmoor, Sep 13 '22 at 12:26

Downsample dataframe correctly

1 Answers1