1

I'm trying to convert a dataframe in Python, and I am stuck because I don't know exactly how to phrase what I want to do (which makes searching for it difficult). It seems like I need a combination of unstack and pivot. I can explain it with an example though, hopefully. I have a dataframe with this shape:

userid GroupA_measure1 GroupA_measure2 GroupB_measure1 GroupB_measure2
001 65 70 45 50
002 96 89 12 8
003 12 14 38 40

and I want to convert this to this format:

userid measure groupA groupB
001 1 65 45
001 2 70 50
002 1 96 12
002 2 89 8
003 1 12 38
003 2 14 40

I can unstack the whole df using pd.melt(df, id_vars =['userid']) which put all values in separate rows, but I want to keep separate columns for the values of GroupA and GroupB.

Any help would be much appreciated.

KJ_
  • 336
  • 1
  • 3
  • 11

2 Answers2

1

Use wide_to_long with extract numbers from measure column by Series.str.extract:

df1 = pd.wide_to_long(df, 
                      stubnames=['GroupA','GroupB'], 
                      i='userid', 
                      j='measure', sep='_', suffix=r'\w+').reset_index()

df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)

Or convert not _ columns first, split all columns with _ and reshape by DataFrame.stack, last also extract digits:

df1 = df.set_index('userid')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis((None, 'measure'), axis=1).stack().reset_index()
df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
print (df1)
  userid  measure GroupA GroupB
0    001        1     65     45
1    002        1     96     12
2    003        1     12     38
3    001        2     70     50
4    002        2     89      8
5    003        2     14     40

Last if necessary sorting by DataFrame.sort_values:

df1 = df1.sort_values('userid', ignore_index=True)
print (df1)
  userid  measure GroupA GroupB
0    001        1     65     45
1    001        2     70     50
2    002        1     96     12
3    002        2     89      8
4    003        1     12     38
5    003        2     14     40
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

One option is with the pivot_longer function from pyjanitor, using the .value placeholder:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(index="userid", 
                names_to=(".value", "measure"), 
                names_pattern=r"(.+)_*(\d)"
               )

   userid measure  GroupA_measure  GroupB_measure
0     001       1              65              45
1     002       1              96              12
2     003       1              12              38
3     001       2              70              50
4     002       2              89               8
5     003       2              14              40

The names_pattern is a regex to split the columns. The .value keeps the part of the column associated with it as header, while the number extracted via the regex goes into the measure column.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31