Clean the time series data by Python

Question

I have a raw data set with ship code (MMSI), time. Because of large amount of raw data, now I want to remove data by ship code (MMSI) with time step of more than 10 minutes. For example:

enter image description here

To this:
enter image description here

I tried to calculate the time interval between rows then used "for" and "if" but it seems quite complicated. I am new to coding. These what I already done so far:

df['diff'] = df.sort_values(['MMSI','TIME']).groupby('MMSI')['TIME'].diff()
df = df.dropna(subset=['diff'])
for i in ship_list:
    df2 = df.loc[df['MMSI'] == i]
    total = 0
    if (total< 10)
        total = df['diff'].iloc() #stuck here

score 0 · Answer 1 · answered Nov 28 '19 at 10:50

Possible approach to the problem - take a single row from every 10-minute interval (freq='10T' means 10 minutes).

import pandas as pd

# sample data
import numpy as np
df = pd.DataFrame({
    'shipcode': [1] * 5 + [2] * 5 + [1] * 5 + [2] * 5,
    'time': np.array([1.01e18 + 8.3e10 * val for val in range(20)]).astype('datetime64')})

# print(df)

# floor 'time' values with a step of 10 minutes
df['floored_time'] = df['time'].dt.floor(freq='10T')

# groupby to remove duplicates of 'shipcode', 'floored_time' pairs
dfg = df.groupby(['shipcode', 'floored_time']).agg({'time': 'min'})

# construct resulting table
dfg['shipcode'] = dfg.index.get_level_values(0)
ans = dfg[['shipcode', 'time']].sort_values(by='time').reset_index(drop=True)

# print(ans)

Sample data:

    shipcode                time
0          1 2002-01-02 19:33:20
1          1 2002-01-02 19:34:43
2          1 2002-01-02 19:36:06
3          1 2002-01-02 19:37:29
4          1 2002-01-02 19:38:52
5          2 2002-01-02 19:40:15
6          2 2002-01-02 19:41:38
7          2 2002-01-02 19:43:01
8          2 2002-01-02 19:44:24
9          2 2002-01-02 19:45:47
10         1 2002-01-02 19:47:10
11         1 2002-01-02 19:48:33
12         1 2002-01-02 19:49:56
13         1 2002-01-02 19:51:19
14         1 2002-01-02 19:52:42
15         2 2002-01-02 19:54:05
16         2 2002-01-02 19:55:28
17         2 2002-01-02 19:56:51
18         2 2002-01-02 19:58:14
19         2 2002-01-02 19:59:37

Result:

   shipcode                time
0         1 2002-01-02 19:33:20
1         2 2002-01-02 19:40:15
2         1 2002-01-02 19:47:10
3         1 2002-01-02 19:51:19
4         2 2002-01-02 19:54:05

Thank you a lot. The code worked. I really appreciate! – Hải Đăng Bùi Dec 04 '19 at 08:22 — Hải Đăng Bùi, Dec 04 '19 at 08:22

Clean the time series data by Python

1 Answers1