TLDR:
How do I add columns containing ordered timeseries measurement data from rows of an existing column?
The Data
I have a dataframe with data from 8 machines. The data includes a unique machine ID
, a timestamp
, a measurement
taken from an internal sensor, a count
value indicating the order the measurements were taken (this can also be inferred by the timestamp
values), and finally a binary failure
value where 0
indicates the machine is operational and 1
indicating the machine failed`.
import numpy as np
import pandas as pd
data = {
'ID': [1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6],
'measurement': [1.365, 5.454, 6.825, 5.742, 6.529, 1.937, 0.246, 0.259, 4.755, 5.678, 0.635, 6.451, 2.082, 8.76, 5.423, 0.645, 4.078, 1.918],
'timestamp': [np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:11.990000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:16.011000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:17.971000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:20.067000'), np.datetime64('2022-06-28 16:58:21.801000'), np.datetime64('2022-06-28 16:58:21.801000')],
'count': [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'failure': [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
}
# Create DataFrame
df = pd.DataFrame({k:pd.Series(v) for k,v in data.items()})
# Print the output.
print(df)
Output:
ID measurement timestamp count failure
0 1 1.365 2022-06-28 16:58:11.990 1 0
1 2 5.454 2022-06-28 16:58:11.990 1 1
2 3 6.825 2022-06-28 16:58:11.990 1 0
3 4 5.742 2022-06-28 16:58:11.990 1 0
4 1 6.529 2022-06-28 16:58:16.011 2 0
5 2 1.937 2022-06-28 16:58:16.011 2 1
6 3 0.246 2022-06-28 16:58:16.011 2 0
7 4 0.259 2022-06-28 16:58:16.011 2 0
8 5 4.755 2022-06-28 16:58:17.971 1 0
9 6 5.678 2022-06-28 16:58:17.971 1 0
10 7 0.635 2022-06-28 16:58:17.971 1 0
11 8 6.451 2022-06-28 16:58:17.971 1 0
12 5 2.082 2022-06-28 16:58:20.067 2 0
13 6 8.760 2022-06-28 16:58:20.067 2 1
14 7 5.423 2022-06-28 16:58:20.067 2 1
15 8 0.645 2022-06-28 16:58:20.067 2 0
16 5 4.078 2022-06-28 16:58:21.801 3 0
17 6 1.918 2022-06-28 16:58:21.801 3 0
The Problem
I want to add two columns to the dataframe containing the last two measurement
values taken before that timepoint. If there are fewer than 2 previous measurement values, I want the column(s) to be populated with NaN values. My desired output looks like this.
ID measurement past_measure_1 past_measure_2 timestamp count failure
0 1 1.365 NaN NaN 2022-06-28 16:58:11.990 1 0
1 2 5.454 NaN NaN 2022-06-28 16:58:11.990 1 1
2 3 6.825 NaN NaN 2022-06-28 16:58:11.990 1 0
3 4 5.742 NaN NaN 2022-06-28 16:58:11.990 1 0
4 1 6.529 1.365 NaN 2022-06-28 16:58:16.011 2 0
5 2 1.937 5.454 NaN 2022-06-28 16:58:16.011 2 1
6 3 0.246 6.825 NaN 2022-06-28 16:58:16.011 2 0
7 4 0.259 5.742 NaN 2022-06-28 16:58:16.011 2 0
8 5 4.755 NaN NaN 2022-06-28 16:58:17.971 1 0
9 6 5.678 NaN NaN 2022-06-28 16:58:17.971 1 0
10 7 0.635 NaN NaN 2022-06-28 16:58:17.971 1 0
11 8 6.451 NaN NaN 2022-06-28 16:58:17.971 1 0
12 5 2.082 4.755 NaN 2022-06-28 16:58:20.067 2 0
13 6 8.760 5.678 NaN 2022-06-28 16:58:20.067 2 1
14 7 5.423 0.635 NaN 2022-06-28 16:58:20.067 2 1
15 8 0.645 6.451 NaN 2022-06-28 16:58:20.067 2 0
16 5 4.078 2.082 4.755 2022-06-28 16:58:21.801 3 0
17 6 1.918 8.760 5.678 2022-06-28 16:58:21.801 3 0
I've figured out how to group all the measurement values by machine ID
number using Pandas pivot (shown below),
df_pivot = df.pivot(
index='ID',
columns='count',
values='measurement'
)
df_pivot = df_pivot.rename_axis(None, axis=1).reset_index()
print(df_pivot)
but I don't know how to get from there to my desired output.