1

I have a df1 that looks like this:

year    site    1   2   3   year    site    1   2   3   year    site    1   2   3   year    site    1   2   3
1991    A   4.1 5.9 4.1 1991    B   3.3 4.1 4.1 1991    C   4.1 0.6 4.1 1991    D   4.1 4.1 4.1
1992    A   6.2 5.7 6.2 1992    B   6.2 7.1 6.2 1992    C   6.2 6.2 6.2 1992    D   6.2 9.5 7.4
1993    A   2.6 1.9 4.7 1993    B   2.6 6.2 2.6 1993    C   5.4 8.3 2.6 1993    D   0.4 2.6 2.6

And, I am having trouble transposing the columns - which are months (1,2,3) - to rows for each site so that my reshaped df1 or df2 looks like this:

year    month   Site A  Site B  Site C  Site D
1991    1       4.1     3.3     4.1     4.1
1991    2       5.9     4.1     0.6     4.1
1991    3       4.1     4.1     4.1     4.1
1992    1       6.2     6.2     6.2     6.2
1992    2       5.7     7.1     6.2     9.5
1992    3       6.2     6.2     6.2     7.4
1993    1       2.6     2.6     5.4     0.4
1993    2       1.9     6.2     8.3     2.6
1993    3       4.7     2.6     2.6     2.6

I have tried using 'melt' and 'stack' but I do not understand how to reference the repeating months (1,2,3). Thank you,

user2100039
  • 1,280
  • 2
  • 16
  • 31

3 Answers3

3

Try the following using hard index slicing and reshaping:

#Create input dataframe
np.random.seed(0)
df = pd.concat([pd.DataFrame({'year':[1991, 1992, 1993],
                  'site':[i]*3,
                  1:np.round(np.random.randint(2,8,3)+np.random.random(3),1),
                  2:np.round(np.random.randint(2,8,3)+np.random.random(3),1),
                  3:np.round(np.random.randint(2,8,3)+np.random.random(3),1)}) for i in [*'ABC']], axis=1)

# index slice columns of the dataframe
df_out = pd.concat([df.iloc[:,i:i+5] for i in range(0,df.shape[1],5)])

# Reshape with melt, set_index, and unstack
df_out =  df_out.melt(['year', 'site'], var_name='month')\
      .set_index(['year', 'month', 'site'])['value']\
      .unstack('site').add_prefix('Site ')\
      .reset_index()

print(df_out)

Output:

site  year  month  Site A  Site B  Site C
0     1991      1     6.6     6.0     5.5
1     1991      2     7.3     5.5     7.6
2     1991      3     3.9     2.5     4.7
3     1992      1     7.5     2.1     5.6
4     1992      2     4.1     2.8     7.9
5     1992      3     2.1     3.8     2.1
6     1993      1     2.4     5.9     4.0
7     1993      2     6.3     3.1     2.7
8     1993      3     3.1     3.1     3.7
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
3

We can create a new level of columns by which each column header is grouped positionally with groupby cumcount. The advantage of this is the columns do not need to be in a set order as long as they are identically named.

Then use stack to get all the separate groups into rows, set_index to exclude the site and year columns, then stack and unstack to groups by site instead of month:

# calculate new MultiIndex level
midx = pd.MultiIndex.from_arrays([
    df.columns,
    df.columns.to_series().groupby(level=0).cumcount()
])
new_df = (
    df.set_axis(midx, axis=1)  # replace columns
        .stack()  # Move all groups into rows
        .set_index(['site', 'year'])  # save site and year
        .rename_axis(columns='Month')  # rename column axis to Month
        .stack()  # Move all month columns to rows
        .unstack(level='site')  # Convert to site rows to columns
        .add_prefix('Site ')  # Add Prefix
        .rename_axis(columns=None)  # Remove Axis Name
        .reset_index()  # Restore Range Index
)

new_df:

   year  Month Site A Site B Site C Site D
0  1991      1     A1     B1     C1     D1
1  1991      2     A2     B2     C2     D2
2  1991      3     A3     B3     C3     D3
3  1992      1     A1     B1     C1     D1
4  1992      2     A2     B2     C2     D2
5  1992      3     A3     B3     C3     D3
6  1993      1     A1     B1     C1     D1
7  1993      2     A2     B2     C2     D2
8  1993      3     A3     B3     C3     D3

The more risky approach is to reshape the DataFrame.values based on a set number of unique columns (5) then the rest is the same as above:

unique_cols = df.columns.unique().tolist()
new_df = (
    pd.DataFrame(
        # reshape dataframe into len(unique_cols) columns
        # and however many rows
        df.values.reshape((-1, len(unique_cols))),
        columns=unique_cols  # restore column names
    ).set_index(['year', 'site'])
        .rename_axis(columns='Month')  # rename column axis to Month
        .stack()  # Move all month columns to rows
        .unstack(level='site')  # Convert to site rows to columns
        .add_prefix('Site ')  # Add Prefix
        .rename_axis(columns=None)  # Remove Axis Name
        .reset_index()  # Restore Range Index
)

new_df:

   year  Month Site A Site B Site C Site D
0  1991      1     A1     B1     C1     D1
1  1991      2     A2     B2     C2     D2
2  1991      3     A3     B3     C3     D3
3  1992      1     A1     B1     C1     D1
4  1992      2     A2     B2     C2     D2
5  1992      3     A3     B3     C3     D3
6  1993      1     A1     B1     C1     D1
7  1993      2     A2     B2     C2     D2
8  1993      3     A3     B3     C3     D3

*Note this approach only works if the DataFrame structure can be guaranteed as we're bypassing all pandas data integrity checks by reshaping with numpy.


Setup Used:

from itertools import chain

import pandas as pd

sites = "ABCD"
df = pd.DataFrame(
    chain.from_iterable([range(1991, 1994),
                         [f'{v}'] * 3,
                         [f'{v}1'] * 3,
                         [f'{v}2'] * 3,
                         [f'{v}3'] * 3] for v in sites)
).T
df.columns = ['year', 'site', 1, 2, 3] * len(sites)

Abridged df:

   year site   1   2   3  year site   1  ...   1   2   3  year site   1   2   3
0  1991    A  A1  A2  A3  1991    B  B1  ...  C1  C2  C3  1991    D  D1  D2  D3
1  1992    A  A1  A2  A3  1992    B  B1  ...  C1  C2  C3  1992    D  D1  D2  D3
2  1993    A  A1  A2  A3  1993    B  B1  ...  C1  C2  C3  1993    D  D1  D2  D3
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
0

@HenryEcker's solution is the right and preferred one, especially when the columns are not structured as the one above.

The solution below uses pivot_longer from pyjanitor, and assumes an order to the columns (if you are not sure of the order, @HenryEcker's solution is safe and does the job, with the cumount unique idea):

# using Henry's data
# pip install pyjanitor
import janitor
import pandas as pd

df = df.rename(columns = str)
unique_columns = [*df.columns.unique()

(df.pivot_longer(names_to = unique_columns], 
                 names_pattern = unique_columns)
   .pivot('year', 'site')
   .stack(level = 0)
   .add_prefix('Site')
   .rename_axis(columns = None, 
                index = ['year', 'month'])
   .reset_index()
)
 
   year month SiteA SiteB SiteC SiteD
0  1991     1    A1    B1    C1    D1
1  1991     2    A2    B2    C2    D2
2  1991     3    A3    B3    C3    D3
3  1992     1    A1    B1    C1    D1
4  1992     2    A2    B2    C2    D2
5  1992     3    A3    B3    C3    D3
6  1993     1    A1    B1    C1    D1
7  1993     2    A2    B2    C2    D2
8  1993     3    A3    B3    C3    D3

sammywemmy
  • 27,093
  • 4
  • 17
  • 31