1

I have an excel worksheet that has some data in the following format:

    Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  Unnamed: 10 Unnamed: 11 Unnamed: 12
0   Hour 1  FI  NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
1   D2 ND: 17-12-2022   7258    2101.2  751.334 567.917 418.5   35.1    1370.9  1254.971    1854.434    1584.931    1396.1  1633.6
2   D1 ND: 16-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
3   D1 ND: 15-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
4   D1 ND: 14-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
5   D1 ND: 13-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
6   D1 ND: 10-12-2022   3702.878    1984.168    -1435.167   -130.916    802 316.1   1343.495    1367.602    1838.14 1251.873    981 1474.2
7   Selected: 16-12-2022:7  4885.746    1960.018    -799.833    -76.084 628 -38.1   1356.89 1537.029    1730.735    1412.038    1960.2  1878.3

This then repeats over 24 hours in the same block format of 8 rows and 13 columns, like so:

    Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  Unnamed: 10 Unnamed: 11 Unnamed: 12
8   Hour 2  FI  NO2 DK1 DK2 SE1 SE2 NO4 NO1 NO3 NO5 SE3 SE4
9   D2 ND: 17-12-2022   7178.9  2167.879    785.333 524.5   452.7   -29.3   1346.132    1151.952    1818.116    1580.202    1144.8  1578.2
10  D1 ND: 16-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
11  D1 ND: 15-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
12  D1 ND: 14-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
13  D1 ND: 13-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
14  D1 ND: 10-12-2022   3641.45 1921.937    -1335.417   -90.75  781.6   190.2   1298.576    1265.627    1763.42 1236.619    811.4   1506.3
15  Selected: 16-12-2022:7  4885.746    1960.018    -799.833    -76.084 628 -38.1   1356.89 1537.029    1730.735    1412.038    1960.2  1878.3

I want to convert this to long format with the following columns:

pd.Dataframe(columns = ['Datetime','BZ','type','Horizon (D1 or D2)','Value','Day'])
  • Datetime is a date from the name of the doc and the hour is Hour 1, Hour 2 from the first column, first row, every 7th entry.
  • BZ is the first row: FI, NO2 etc.
  • type is either ND or 'Selected'
  • Horizon is D1 or D2
  • Value is all the numbers
  • Day is the dates from the 2nd column

What have I done so far?

I'm really not sure the best way of attacking this tbh. I thought about splitting the dataframe up into column relevant dataframes or arrays and using them as the id_vars in pd.melt():

[j for j in df.iloc[:,0] if str(j).startswith('Selected')]
[j for j in df.iloc[:,0] if str(j).startswith('D1')]
[j for j in df.iloc[:,0] if str(j).startswith('D2')]

but this won't work.

I guess the issue fundamentally is, how do you melt a dataframe when you need to extract several variables from single cell, multiple row and cyclically!?

Reproducible example added, side-note, there are 24 of these chunks corresponding to hours of a day:

AL = [['Hour X','DK1','DK2','FI','NO1','NO2','NO3','NO4','NO5','SE1','SE2','SE3','SE4'],
      ['D2 ND: 17-12-2022',7258,2101.2,751.334, 567.917, 418.5, 35.1, 1370.9, 1254.971, 1854.434, 1584.931, 1396.1, 1633.6],
      ['D1 ND: 16-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
      ['D1 ND: 15-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
      ['D1 ND: 14-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
      ['D1 ND: 13-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
      ['D1 ND: 10-12-2022',3702.878,1984.168,-1435.167,-130.916,802,316.1,1343.495,1367.602,1838.14,1251.873,981,1474.2],
      ['Selected: 16-12-2022:7',4885.746,1960.018,-799.833,-76.084,628,-38.1,1356.89,1537.029,1730.735,1412.038,1960.2,1878.3]]

example = pd.DataFrame(AL,columns = ['Unnamed: 0','Unnamed: 1','Unnamed: 2','Unnamed: 3','Unnamed: 4','Unnamed: 5','Unnamed: 6','Unnamed: 7','Unnamed: 8','Unnamed: 9','Unnamed: 10','Unnamed: 11','Unnamed: 12']
                       )

Expected outcome (value column contains arbitrary figures):

AL = pd.DataFrame([['17-12-2022 00:00:00', 'FI','ND', 'D2','17-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','ND', 'D1','16-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','ND', 'D1','15-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','ND', 'D1','14-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','ND', 'D1','13-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','ND', 'D1','10-12-2022 00:00:00',7256],
      ['17-12-2022 00:00:00', 'FI','Selected', 'NaN','16-12-2022 07:00:00',7256],
      ['17-12-2022 00:00:00', 'DK','ND', 'D2','17-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','ND', 'D1','16-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','ND', 'D1','15-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','ND', 'D1','14-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','ND', 'D1','13-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','ND', 'D1','10-12-2022 00:00:00',4556],
      ['17-12-2022 00:00:00', 'DK','Selected', 'Nan','16-12-2022 07:00:00',4556]],
       columns =['DateTimeUTC','BZ','Type (ND/Selected)','Horizon','Day','Value'])

I've only added enough rows to include the first two BZ's from the first hour otherwise this would extend through all BZ and all 24 hours. The NaN value in the Selected column is also arbitrary.

Tom
  • 109
  • 9
  • Its just an artifact of importing the file so I maintain the structure. essentially I could use row 1 as the columns. I'll try the answer below and if its no good I'll make a screenshot. – Tom May 03 '23 at 11:02
  • I'm not completely sure if this answers your question @Corralien but the Unnamed: X are the column headers, which don't have an index. The second block is just a continuation of the first as you follow the index, just with the column headers, unindexed. I'll add a reproducible to my post – Tom May 03 '23 at 13:13
  • @mozway how's that now? – Tom May 03 '23 at 20:20
  • @Tom. Can you check my answer please and explain what is the `Day` column? *Day is the dates from the 2nd column*: What does it mean, the second column is Finland? – Corralien May 03 '23 at 21:24
  • @Corralien thanks. The Day column refers to historical data, the idea is I'm using a few days of old data to find an expected outcome for the current day. So those dates represent the pool of days I pick from. The column headers are confusing because the first entry doesn't represent what's in column 1, more accurately it would say BZ/Var(or something else) and the hour would just be another separate column with repeated entries. The BZ row is countries, order is arbitrary but they begin at the 2nd column, the first column contains a long string to be split for example: 'D2 ND 17-12-2022'. – Tom May 04 '23 at 05:34
  • For your sample input, can you provide the expected output so I can correct my answer if needed. – Corralien May 04 '23 at 06:01
  • @Corralien I have amended my question a little as I realised it would be easier (on the eye) to encode the Hour into the Day column rather than have a whole column with one repeating value. – Tom May 04 '23 at 10:08

2 Answers2

2

Before melt your dataframe, you have to clean it. IIUC, you can use:

# Mask for data rows
m = ~df['Unnamed: 0'].str.startswith('Hour')

# Create datetime index
date = '2023-05-03'  # date from doc
hour = df['Unnamed: 0'].str.extract('Hour (\d+)', expand=False).ffill()
dti = pd.to_datetime(date + ' ' + hour, format='%Y-%m-%d %H')[m].rename('Datetime')

# Horizon, Type, Day columns
pattern = r'(?P<Horizon>D1|D2)?\s*(?P<Type>ND|Selected):\s*(?P<Day>[^:]+)'
meta = df.loc[m, 'Unnamed: 0'].str.extract(pattern)

# Extract Bid Zones and data
bz = df.loc[0, 'Unnamed: 1':].rename(None)
data = df.loc[m, 'Unnamed: 1':].set_axis(bz, axis=1)

cleaned_df = pd.concat([dti, meta, data], axis=1)
out = cleaned_df.melt(['Datetime', 'Horizon', 'Type', 'Day'],
                      var_name='BZ', value_name='Value')

Outputs:

>>> cleaned_df
              Datetime Horizon      Type         Day        FI       NO2        DK1       DK2    SE1    SE2       NO4       NO1       NO3       NO5     SE3     SE4
1  2023-05-03 01:00:00      D2        ND  17-12-2022      7258    2101.2    751.334   567.917  418.5   35.1    1370.9  1254.971  1854.434  1584.931  1396.1  1633.6
2  2023-05-03 01:00:00      D1        ND  16-12-2022  3702.878  1984.168  -1435.167  -130.916    802  316.1  1343.495  1367.602   1838.14  1251.873     981  1474.2
3  2023-05-03 01:00:00      D1        ND  15-12-2022  3702.878  1984.168  -1435.167  -130.916    802  316.1  1343.495  1367.602   1838.14  1251.873     981  1474.2
4  2023-05-03 01:00:00      D1        ND  14-12-2022  3702.878  1984.168  -1435.167  -130.916    802  316.1  1343.495  1367.602   1838.14  1251.873     981  1474.2
5  2023-05-03 01:00:00      D1        ND  13-12-2022  3702.878  1984.168  -1435.167  -130.916    802  316.1  1343.495  1367.602   1838.14  1251.873     981  1474.2
6  2023-05-03 01:00:00      D1        ND  10-12-2022  3702.878  1984.168  -1435.167  -130.916    802  316.1  1343.495  1367.602   1838.14  1251.873     981  1474.2
7  2023-05-03 01:00:00     NaN  Selected  16-12-2022  4885.746  1960.018   -799.833   -76.084    628  -38.1   1356.89  1537.029  1730.735  1412.038  1960.2  1878.3
9  2023-05-03 02:00:00      D2        ND  17-12-2022    7178.9  2167.879    785.333     524.5  452.7  -29.3  1346.132  1151.952  1818.116  1580.202  1144.8  1578.2
10 2023-05-03 02:00:00      D1        ND  16-12-2022   3641.45  1921.937  -1335.417    -90.75  781.6  190.2  1298.576  1265.627   1763.42  1236.619   811.4  1506.3
11 2023-05-03 02:00:00      D1        ND  15-12-2022   3641.45  1921.937  -1335.417    -90.75  781.6  190.2  1298.576  1265.627   1763.42  1236.619   811.4  1506.3
12 2023-05-03 02:00:00      D1        ND  14-12-2022   3641.45  1921.937  -1335.417    -90.75  781.6  190.2  1298.576  1265.627   1763.42  1236.619   811.4  1506.3
13 2023-05-03 02:00:00      D1        ND  13-12-2022   3641.45  1921.937  -1335.417    -90.75  781.6  190.2  1298.576  1265.627   1763.42  1236.619   811.4  1506.3
14 2023-05-03 02:00:00      D1        ND  10-12-2022   3641.45  1921.937  -1335.417    -90.75  781.6  190.2  1298.576  1265.627   1763.42  1236.619   811.4  1506.3
15 2023-05-03 02:00:00     NaN  Selected  16-12-2022  4885.746  1960.018   -799.833   -76.084    628  -38.1   1356.89  1537.029  1730.735  1412.038  1960.2  1878.3
>>> out
               Datetime Horizon      Type         Day   BZ     Value
0   2023-05-03 01:00:00      D2        ND  17-12-2022   FI      7258
1   2023-05-03 01:00:00      D1        ND  16-12-2022   FI  3702.878
2   2023-05-03 01:00:00      D1        ND  15-12-2022   FI  3702.878
3   2023-05-03 01:00:00      D1        ND  14-12-2022   FI  3702.878
4   2023-05-03 01:00:00      D1        ND  13-12-2022   FI  3702.878
..                  ...     ...       ...         ...  ...       ...
163 2023-05-03 02:00:00      D1        ND  15-12-2022  SE4    1506.3
164 2023-05-03 02:00:00      D1        ND  14-12-2022  SE4    1506.3
165 2023-05-03 02:00:00      D1        ND  13-12-2022  SE4    1506.3
166 2023-05-03 02:00:00      D1        ND  10-12-2022  SE4    1506.3
167 2023-05-03 02:00:00     NaN  Selected  16-12-2022  SE4    1878.3

[168 rows x 6 columns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

You can achieve this by first reshaping the data and then using pd.melt() to convert this.

create a function to extract the hour from the 'Unnamed: 0' column:

def extract_hour(s):
    if s.startswith('Hour'):
        return int(s.split(' ')[1])
    else:
        return None

Apply this function to the 'Unnamed: 0' column and forward fill the missing values

df['Hour'] = df['Unnamed: 0'].apply(extract_hour)
df['Hour'] = df['Hour'].fillna(method='ffill')

And then

df = df[~df['Unnamed: 1'].isin(['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4'])]
df['type'] = df['Unnamed: 0'].apply(lambda x: 'Selected' if x.startswith('Selected') else 'ND')
df['Horizon'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[0])
df['Day'] = df['Unnamed: 1'].apply(lambda x: x.split(' ')[-1])
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 1'])
df = df.reset_index(drop=True)
BZ = ['FI', 'NO2', 'DK1', 'DK2', 'SE1', 'SE2', 'NO4', 'NO1', 'NO3', 'NO5', 'SE3', 'SE4']
df_long = pd.melt(df, id_vars=['Hour', 'type', 'Horizon', 'Day'], value_vars=BZ, var_name='BZ', value_name='Value')
df_long['Datetime'] = pd.to_datetime(df_long['Day'], format='%d-%m-%Y') + pd.to_timedelta(df_long['Hour'] - 1, unit='h')
df_long = df_long[['Datetime', 'BZ', 'type', 'Horizon', 'Value', 'Day']]

The output should be something like this:

Datetime    BZ      type Horizon     Value         Day
0 2022-12-17 00:00:00    FI        ND      D2    7258.0  17-12-2022
1 2022-12-16 00:00:00    FI        ND      D1  3702.878  16-12-2022
2 2022-12-15 00:00:00    FI        ND      D1  3702.878  15-12-2022
3 2022-12-14 00:00:00    FI        ND      D1  3702.878  14-12-2022
4 2022-12-13 00:00:00    FI        ND      D1  3702.878  13-12-2022
5 2022-12-10 00:00:00    FI        ND      D1  3702.878  10-12-2022
6 2022-12-16 00:00:00    FI  Selected       7  4885.746  16-12-2022
7 2022-12-17 01:00:00    FI        ND      D2    7178.9  17-12-2022
8 2022-12-16 01:00:00    FI        ND      D1   3641.45  16-12-2022
9 2022-12-15 01:00:00    FI        ND      D1   3641.45  15-12-2022
Matteo Buffagni
  • 312
  • 2
  • 10