1

TLDR:

How do I add columns containing ordered timeseries measurement data from rows of an existing column?

The Data

I have a dataframe with data from 8 machines. The data includes a unique machine ID, a timestamp, a measurement taken from an internal sensor, a count value indicating the order the measurements were taken (this can also be inferred by the timestamp values), and finally a binary failure value where 0 indicates the machine is operational and 1 indicating the machine failed`.

import numpy as np
import pandas as pd

data = {
    'ID': [1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6],
    'measurement': [1.365, 5.454, 6.825, 5.742, 6.529, 1.937, 0.246, 0.259, 4.755, 5.678, 0.635, 6.451, 2.082, 8.76, 5.423, 0.645, 4.078, 1.918],
    'timestamp': [np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:21.801000'), np.datetime64('2022-06-28 16:58:21.801000')],
    'count': [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
    'failure': [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
}
  
# Create DataFrame
df = pd.DataFrame({k:pd.Series(v) for k,v in data.items()})
  
# Print the output.
print(df)

Output:

    ID  measurement               timestamp  count  failure
0    1        1.365 2022-06-28 16:58:11.990      1        0
1    2        5.454 2022-06-28 16:58:11.990      1        1
2    3        6.825 2022-06-28 16:58:11.990      1        0
3    4        5.742 2022-06-28 16:58:11.990      1        0
4    1        6.529 2022-06-28 16:58:16.011      2        0
5    2        1.937 2022-06-28 16:58:16.011      2        1
6    3        0.246 2022-06-28 16:58:16.011      2        0
7    4        0.259 2022-06-28 16:58:16.011      2        0
8    5        4.755 2022-06-28 16:58:17.971      1        0
9    6        5.678 2022-06-28 16:58:17.971      1        0
10   7        0.635 2022-06-28 16:58:17.971      1        0
11   8        6.451 2022-06-28 16:58:17.971      1        0
12   5        2.082 2022-06-28 16:58:20.067      2        0
13   6        8.760 2022-06-28 16:58:20.067      2        1
14   7        5.423 2022-06-28 16:58:20.067      2        1
15   8        0.645 2022-06-28 16:58:20.067      2        0
16   5        4.078 2022-06-28 16:58:21.801      3        0
17   6        1.918 2022-06-28 16:58:21.801      3        0

The Problem

I want to add two columns to the dataframe containing the last two measurement values taken before that timepoint. If there are fewer than 2 previous measurement values, I want the column(s) to be populated with NaN values. My desired output looks like this.

    ID  measurement  past_measure_1  past_measure_2               timestamp  count  failure
0    1        1.365             NaN             NaN 2022-06-28 16:58:11.990      1        0
1    2        5.454             NaN             NaN 2022-06-28 16:58:11.990      1        1
2    3        6.825             NaN             NaN 2022-06-28 16:58:11.990      1        0
3    4        5.742             NaN             NaN 2022-06-28 16:58:11.990      1        0
4    1        6.529           1.365             NaN 2022-06-28 16:58:16.011      2        0
5    2        1.937           5.454             NaN 2022-06-28 16:58:16.011      2        1
6    3        0.246           6.825             NaN 2022-06-28 16:58:16.011      2        0
7    4        0.259           5.742             NaN 2022-06-28 16:58:16.011      2        0
8    5        4.755             NaN             NaN 2022-06-28 16:58:17.971      1        0
9    6        5.678             NaN             NaN 2022-06-28 16:58:17.971      1        0
10   7        0.635             NaN             NaN 2022-06-28 16:58:17.971      1        0
11   8        6.451             NaN             NaN 2022-06-28 16:58:17.971      1        0
12   5        2.082           4.755             NaN 2022-06-28 16:58:20.067      2        0
13   6        8.760           5.678             NaN 2022-06-28 16:58:20.067      2        1
14   7        5.423           0.635             NaN 2022-06-28 16:58:20.067      2        1
15   8        0.645           6.451             NaN 2022-06-28 16:58:20.067      2        0
16   5        4.078           2.082           4.755 2022-06-28 16:58:21.801      3        0
17   6        1.918           8.760           5.678 2022-06-28 16:58:21.801      3        0

I've figured out how to group all the measurement values by machine ID number using Pandas pivot (shown below),

df_pivot = df.pivot(
    index='ID', 
    columns='count', 
    values='measurement'
)
df_pivot = df_pivot.rename_axis(None, axis=1).reset_index()
print(df_pivot)

but I don't know how to get from there to my desired output.

Rory McGuire
  • 151
  • 9

1 Answers1

3

here you go (assuming df is already sorted by timestamps df.sort_values("timestamp", inplace=True))

df["past_measure_1"] = df.groupby("ID").shift(1)["measurement"]
df["past_measure_2"] = df.groupby("ID").shift(2)["measurement"]

hope that helps

Another suggestion - don't use timestamps without timezone. Always use timezone.

Another suggestion - convert your timestamps to datetime.

df["timestamp"] = pd.to_datetime(df["timestamp"])

Add utc=True if it's utc or other timezone if other timezone

Nikolay Zakirov
  • 1,505
  • 8
  • 17