We can create a new level of columns by which each column header is grouped positionally with groupby cumcount
. The advantage of this is the columns do not need to be in a set order as long as they are identically named.
Then use stack
to get all the separate groups into rows, set_index
to exclude the site and year columns, then stack
and unstack
to groups by site instead of month:
# calculate new MultiIndex level
midx = pd.MultiIndex.from_arrays([
df.columns,
df.columns.to_series().groupby(level=0).cumcount()
])
new_df = (
df.set_axis(midx, axis=1) # replace columns
.stack() # Move all groups into rows
.set_index(['site', 'year']) # save site and year
.rename_axis(columns='Month') # rename column axis to Month
.stack() # Move all month columns to rows
.unstack(level='site') # Convert to site rows to columns
.add_prefix('Site ') # Add Prefix
.rename_axis(columns=None) # Remove Axis Name
.reset_index() # Restore Range Index
)
new_df
:
year Month Site A Site B Site C Site D
0 1991 1 A1 B1 C1 D1
1 1991 2 A2 B2 C2 D2
2 1991 3 A3 B3 C3 D3
3 1992 1 A1 B1 C1 D1
4 1992 2 A2 B2 C2 D2
5 1992 3 A3 B3 C3 D3
6 1993 1 A1 B1 C1 D1
7 1993 2 A2 B2 C2 D2
8 1993 3 A3 B3 C3 D3
The more risky approach is to reshape
the DataFrame.values
based on a set number of unique columns (5
) then the rest is the same as above:
unique_cols = df.columns.unique().tolist()
new_df = (
pd.DataFrame(
# reshape dataframe into len(unique_cols) columns
# and however many rows
df.values.reshape((-1, len(unique_cols))),
columns=unique_cols # restore column names
).set_index(['year', 'site'])
.rename_axis(columns='Month') # rename column axis to Month
.stack() # Move all month columns to rows
.unstack(level='site') # Convert to site rows to columns
.add_prefix('Site ') # Add Prefix
.rename_axis(columns=None) # Remove Axis Name
.reset_index() # Restore Range Index
)
new_df
:
year Month Site A Site B Site C Site D
0 1991 1 A1 B1 C1 D1
1 1991 2 A2 B2 C2 D2
2 1991 3 A3 B3 C3 D3
3 1992 1 A1 B1 C1 D1
4 1992 2 A2 B2 C2 D2
5 1992 3 A3 B3 C3 D3
6 1993 1 A1 B1 C1 D1
7 1993 2 A2 B2 C2 D2
8 1993 3 A3 B3 C3 D3
*Note this approach only works if the DataFrame structure can be guaranteed as we're bypassing all pandas data integrity checks by reshaping with numpy.
Setup Used:
from itertools import chain
import pandas as pd
sites = "ABCD"
df = pd.DataFrame(
chain.from_iterable([range(1991, 1994),
[f'{v}'] * 3,
[f'{v}1'] * 3,
[f'{v}2'] * 3,
[f'{v}3'] * 3] for v in sites)
).T
df.columns = ['year', 'site', 1, 2, 3] * len(sites)
Abridged df
:
year site 1 2 3 year site 1 ... 1 2 3 year site 1 2 3
0 1991 A A1 A2 A3 1991 B B1 ... C1 C2 C3 1991 D D1 D2 D3
1 1992 A A1 A2 A3 1992 B B1 ... C1 C2 C3 1992 D D1 D2 D3
2 1993 A A1 A2 A3 1993 B B1 ... C1 C2 C3 1993 D D1 D2 D3