I am trying to reshape data using pandas and have been having a hard time getting it into the right format. Roughly, the data look like this*:
df = pd.DataFrame({'PRODUCT':['1','2'],
'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)
PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0 1 2020-01-05 2020-01-22 2020-02-07 NaT
1 2 2020-01-17 2020-03-04 2020-03-15 2020-04-28
I would like to reshape the data so that it looks like this:
reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
np.nan,pd.Timestamp('2020-04-28')],
'STAGE':['design','design','design','design','production','production','production','production'],
'STATUS':['started','started','completed','completed','started','started','completed','completed']})
print(reshaped_df)
DATE STAGE STATUS
0 2020-01-05 design started
1 2020-01-17 design started
2 2020-01-22 design completed
3 2020-03-04 design completed
4 2020-02-07 production started
5 2020-03-15 production started
6 NaT production completed
7 2020-04-28 production completed
How can I go about doing this? Is there a better format to reshape it to?
Ultimately I'd like to do some group summaries on the data, such as the number of times each step occurred, e.g.
reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()
STAGE STATUS
design completed 2
started 2
production completed 1
started 2
Name: DATE, dtype: int64
Thank you
- The data actually contain many date start/stop columns for different stages of the manufacturing pipeline