I have an excel worksheet that has some data in the following format:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Hour 1 FI NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
1 D2 ND: 17-12-2022 7258 2101.2 751.334 567.917 418.5 35.1 1370.9 1254.971 1854.434 1584.931 1396.1 1633.6
2 D1 ND: 16-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
3 D1 ND: 15-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
4 D1 ND: 14-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
5 D1 ND: 13-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
6 D1 ND: 10-12-2022 3702.878 1984.168 -1435.167 -130.916 802 316.1 1343.495 1367.602 1838.14 1251.873 981 1474.2
7 Selected: 16-12-2022:7 4885.746 1960.018 -799.833 -76.084 628 -38.1 1356.89 1537.029 1730.735 1412.038 1960.2 1878.3
This then repeats over 24 hours in the same block format of 8 rows and 13 columns, like so:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
8 Hour 2 FI NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
9 D2 ND: 17-12-2022 7178.9 2167.879 785.333 524.5 452.7 -29.3 1346.132 1151.952 1818.116 1580.202 1144.8 1578.2
10 D1 ND: 16-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
11 D1 ND: 15-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
12 D1 ND: 14-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
13 D1 ND: 13-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
14 D1 ND: 10-12-2022 3641.45 1921.937 -1335.417 -90.75 781.6 190.2 1298.576 1265.627 1763.42 1236.619 811.4 1506.3
15 Selected: 16-12-2022:7 4885.746 1960.018 -799.833 -76.084 628 -38.1 1356.89 1537.029 1730.735 1412.038 1960.2 1878.3
I want to convert this to long format with the following columns:
pd.Dataframe(columns = ['Datetime','BZ','type','Horizon (D1 or D2)','Value','Day'])
- Datetime is a date from the name of the doc and the hour is Hour 1, Hour 2 from the first column, first row, every 7th entry.
- BZ is the first row: FI, NO2 etc.
- type is either ND or 'Selected'
- Horizon is D1 or D2
- Value is all the numbers
- Day is the dates from the 2nd column
What have I done so far?
I'm really not sure the best way of attacking this tbh. I thought about splitting the dataframe up into column relevant dataframes or arrays and using them as the id_vars
in pd.melt()
:
[j for j in df.iloc[:,0] if str(j).startswith('Selected')]
[j for j in df.iloc[:,0] if str(j).startswith('D1')]
[j for j in df.iloc[:,0] if str(j).startswith('D2')]
but this won't work.
I guess the issue fundamentally is, how do you melt a dataframe when you need to extract several variables from single cell, multiple row and cyclically!?
Reproducible example added, side-note, there are 24 of these chunks corresponding to hours of a day:
AL = [['Hour X','DK1','DK2','FI','NO1','NO2','NO3','NO4','NO5','SE1','SE2','SE3','SE4'],
['D2 ND: 17-12-2022',7258,2101.2,751.334, 567.917, 418.5, 35.1, 1370.9, 1254.971, 1854.434, 1584.931, 1396.1, 1633.6],
['D1 ND: 16-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
['D1 ND: 15-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
['D1 ND: 14-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
['D1 ND: 13-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
['D1 ND: 10-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
['Selected: 16-12-2022:7',4885.746,1960.018,-799.833,-76.084,628,-38.1,1356.89,1537.029,1730.735,1412.038,1960.2,1878.3]]
example = pd.DataFrame(AL,columns = ['Unnamed: 0','Unnamed: 1','Unnamed: 2','Unnamed: 3','Unnamed: 4','Unnamed: 5','Unnamed: 6','Unnamed: 7','Unnamed: 8','Unnamed: 9','Unnamed: 10','Unnamed: 11','Unnamed: 12']
)
Expected outcome (value column contains arbitrary figures):
AL = pd.DataFrame([['17-12-2022 00:00:00', 'FI','ND', 'D2','17-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','ND', 'D1','16-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','ND', 'D1','15-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','ND', 'D1','14-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','ND', 'D1','13-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','ND', 'D1','10-12-2022 00:00:00',7256],
['17-12-2022 00:00:00', 'FI','Selected', 'NaN','16-12-2022 07:00:00',7256],
['17-12-2022 00:00:00', 'DK','ND', 'D2','17-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','ND', 'D1','16-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','ND', 'D1','15-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','ND', 'D1','14-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','ND', 'D1','13-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','ND', 'D1','10-12-2022 00:00:00',4556],
['17-12-2022 00:00:00', 'DK','Selected', 'Nan','16-12-2022 07:00:00',4556]],
columns =['DateTimeUTC','BZ','Type (ND/Selected)','Horizon','Day','Value'])
I've only added enough rows to include the first two BZ's from the first hour otherwise this would extend through all BZ and all 24 hours. The NaN value in the Selected column is also arbitrary.