0

I am having a input dataset, see an input sample bellow, and I want to downsample it. To do so I am using

resample_time=25
init_len = len(df.index)
df = df.set_index('time', drop=False).resample('{}S'.format(resample_time)).last().dropna()
df.index = range(0, len(df.index))

A sample of the output can be found below. However, the output I am getting is not the expected, see expected output bellow. That is to keep a row every 25 seconds. Could you please someone explain why is that happening and how we can fix it?

Input:

          lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317233  40.075250 2007-05-06 04:21:53
3    116.317217  40.075417 2007-05-06 04:22:04
4    116.317133  40.075567 2007-05-06 04:22:23
5    116.317167  40.075400 2007-05-06 04:46:48
6    116.317233  40.075183 2007-05-06 04:46:54
7    116.317050  40.074933 2007-05-06 04:47:00
8    116.313567  40.073983 2007-05-06 04:47:36
9    116.311133  40.073167 2007-05-06 04:48:44
10   116.308017  40.072300 2007-05-06 04:49:15
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

Output:

            lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317217  40.075417 2007-05-06 04:22:04
3    116.317133  40.075567 2007-05-06 04:22:23
4    116.317050  40.074933 2007-05-06 04:47:00
5    116.313567  40.073983 2007-05-06 04:47:36
6    116.311133  40.073167 2007-05-06 04:48:44
7    116.307467  40.072483 2007-05-06 04:49:22
8    116.306450  40.074283 2007-05-06 04:49:52
9    116.308567  40.071850 2007-05-06 04:50:30
10   116.308667  40.071650 2007-05-06 04:50:57
11   116.310450  40.068850 2007-05-06 04:51:38
12   116.311800  40.067717 2007-05-06 04:52:02
13   116.312300  40.067067 2007-05-06 04:52:21
14   116.312667  40.066617 2007-05-06 04:52:32
15   116.312800  40.066450 2007-05-06 04:53:05
16   116.314067  40.064867 2007-05-06 04:53:38
17   116.314783  40.063667 2007-05-06 04:54:14
18   116.315867  40.062167 2007-05-06 04:54:41
19   116.318550  40.058583 2007-05-06 04:55:20

Expected output:

        lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12 -> Include
1    116.317067  40.075217 2007-05-06 04:21:33 -> Exclude
2    116.317233  40.075250 2007-05-06 04:21:53 -> Include
3    116.317217  40.075417 2007-05-06 04:22:04 -> Exclude
4    116.317133  40.075567 2007-05-06 04:22:23 -> Include
5    116.317167  40.075400 2007-05-06 04:46:48 -> Include 
6    116.317233  40.075183 2007-05-06 04:46:54 -> Exclude
7    116.317050  40.074933 2007-05-06 04:47:00 -> Exclude
8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude
10   116.308017  40.072300 2007-05-06 04:49:15 -> Include
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

PS: You can read the .csv file in the link using pd.read_csv(' 20070506033305.csv'), parse_dates=['time'])

Darkmoor
  • 862
  • 11
  • 29

1 Answers1

1

Based on your expected output, you seem to want to do this:

  • Starting at the top set a timedelta threshold of 25 seconds and find the first subsequent Timestamp that crosses the threshold.
  • Reset the threshold based on the newly found value and continue through to the end.

Perhaps there is a better way to achieve this result, but the following should work.

Data

import pandas as pd

data = {'lon': {0: 116.317117, 1: 116.317067, 2: 116.317233, 3: 116.317217, 
                4: 116.317133, 5: 116.317167, 6: 116.317233, 7: 116.31705, 
                8: 116.313567, 9: 116.311133, 10: 116.308017, 11: 116.307467, 
                12: 116.30625, 13: 116.30645}, 
        'lat': {0: 40.075417, 1: 40.075217, 2: 40.07525, 3: 40.075417, 
                4: 40.075567, 5: 40.0754, 6: 40.075183, 7: 40.074933, 
                8: 40.073983, 9: 40.073167, 10: 40.0723, 11: 40.072483, 
                12: 40.074017, 13: 40.074283}, 
        'time': {0: '2007-05-06 04:21:12', 1: '2007-05-06 04:21:33', 
                 2: '2007-05-06 04:21:53', 3: '2007-05-06 04:22:04', 
                 4: '2007-05-06 04:22:23', 5: '2007-05-06 04:46:48', 
                 6: '2007-05-06 04:46:54', 7: '2007-05-06 04:47:00', 
                 8: '2007-05-06 04:47:36', 9: '2007-05-06 04:48:44', 
                 10: '2007-05-06 04:49:15', 11: '2007-05-06 04:49:22', 
                 12: '2007-05-06 04:49:45', 13: '2007-05-06 04:49:52'}
        }

df = pd.DataFrame(data)

df['time'] = pd.to_datetime(df.time)

Code

from datetime import timedelta

index_lst = [0]
for i in df.index[:-1]:
    if i in index_lst:
        check = df.time[i+1:].gt(df.time[i]+timedelta(seconds=25))
        if all(~check):
            break
        index_lst.append(check.idxmax())

res = df.iloc[index_lst].reset_index(drop=True)

print(res)

          lon        lat                time
0  116.317117  40.075417 2007-05-06 04:21:12
1  116.317233  40.075250 2007-05-06 04:21:53
2  116.317133  40.075567 2007-05-06 04:22:23
3  116.317167  40.075400 2007-05-06 04:46:48
4  116.313567  40.073983 2007-05-06 04:47:36
5  116.311133  40.073167 2007-05-06 04:48:44
6  116.308017  40.072300 2007-05-06 04:49:15
7  116.306250  40.074017 2007-05-06 04:49:45

Explanation

Let's say i == 0, we get:

df.time[1:].gt(df.time[0]+timedelta(seconds=25)).head()

1    False
2     True # namely: '2007-05-06 04:21:53'
3     True
4     True
5     True
  • With idxmax() we get 2 for the first True and we append this value to our list.
  • On the next iteration we will now skip 1 (not in the list) and restart at 2 (now in the list).

Adding if all(~check): break allows us to step out of the For Loop when we know we won't find any True values anymore.


N.B. I'm assuming here that the following call results from a misread:

8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude

But 9 should be included as well (diff being 1 minute and 8 seconds).

ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • Thanks for the answer. Yes there is a misread in the these two samples you mention. Please give a few minutes to understand your answer. – Darkmoor Sep 13 '22 at 12:26