Pandas wide_to_long with list of stubs

Question

I currently have the following dataframe:

        1_1       1_2       1_3       1_4       1_5       2_1  ...       9_5      10_1      10_2      10_3      10_4      10_5

0  0.049400  0.063812  0.097736 -0.077222  0.112779 -0.201620  ...  0.138246  0.015369 -0.083559 -0.186949  0.158505 -0.046787 
1 -0.169837  0.093606  0.043157  0.095289 -0.078525 -0.026500  ... -0.054344  0.008955  0.045036  0.198438  0.197416 -0.057831 
2 -0.192915  0.001477  0.077699  …

I want to obtain something like this:

cat     u       i       mouse       

0       1       1      0.049400 
1       1       1     -0.169837
2       1       1     -0.192915
0       1       2      0.063812
1       1       2      0.093606
2       1       2      0.001477
…

Essentially, the lines represent the value of the cat column, the number before the underscore represents the u column and the number after represents the i column. Lastly, the mouse column is the value for the combination of the previous factors.

However, the solution should work for any data in such format.

So far, I have this, considering I have access to the list of u (in this case 1,2,3,4,5,6,7,8,9,10) and i (1,2,3,4,5), but the solution should work for different lists and different numbers of lines as well.

u_seq_stub = [u + '_' for u in u_seq] 
df = pd.wide_to_long(df, u_seq_stub, i='u', j='i').reset_index().rename(columns={'_':'u'})

This doesn't work however, and throws "KeyError: "None of [Index(['userid'], dtype='object')] are in the [columns]""... I've also consulted this and it doesn't seem that different from what I want but there must be something I am misunderstanding.

I appreciate in advance any help.

jezrael · Accepted Answer · 2019-11-29T16:01:42.833

Use split by all columns for MultiIndex and then reshape by DataFrame.unstack, change levels for new columns names by DataFrame.rename_axis and last convert it to columns by Series.reset_index:

df.columns = df.columns.str.split('_', expand=True)
df = df.unstack().rename_axis(('u','i','cat')).reset_index(name='mouse')
print (df.head(10))
   u  i  cat     mouse
0  1  1    0  0.049400
1  1  1    1 -0.169837
2  1  1    2 -0.192915
3  1  2    0  0.063812
4  1  2    1  0.093606
5  1  2    2  0.001477
6  1  3    0  0.097736
7  1  3    1  0.043157
8  1  3    2  0.077699
9  1  4    0 -0.077222

Your solution should be changed first with parameter sep in wide_to_long, then another reshape by DataFrame.stack with some data cleaning:

u_seq_stub = ['1','2',...,'9','10']
#alternative
#u_seq_stub = [str(x) for x in range(1,11)]


df = (pd.wide_to_long(df.reset_index(), 
                     u_seq_stub, 
                     i='index', 
                     j='i', 
                     sep='_')
      .stack()
      .reset_index(name='mouse')
      .rename(columns={'index':'cat', 'level_2':'u'})
      .astype({'i':int, 'u':int})
      .sort_values(['u','i','cat'])
      )
print (df.head(10))
    cat  i  u     mouse
0     0  1  1  0.049400
12    1  1  1 -0.169837
24    2  1  1 -0.192915
3     0  2  1  0.063812
15    1  2  1  0.093606
25    2  2  1  0.001477
5     0  3  1  0.097736
17    1  3  1  0.043157
26    2  3  1  0.077699
7     0  4  1 -0.077222

wide_to_long is the right way for this question :-) BTW happy holiday — BENY, Nov 29 '19 at 15:40

ansev · Answer 2 · 2019-11-29T15:41:41.147

1

Use DataFrame.melt + Series.str.split with expand=True. To create cat column you could use GroupBy.cumcount:

new_df=df.melt(var_name='u_i',value_name='mouse')
new_df['cat']=new_df.groupby('u_i').cumcount()
new_df[['u','i']]=new_df['u_i'].str.split('_',expand=True)
new_df=new_df.drop('u_i',axis=1).reindex(columns=['cat','u','i','mouse'])

print(new_df)

   cat  u  i     mouse
0    0  1  1  0.049400
1    1  1  1 -0.169837
2    2  1  1 -0.192915
3    0  1  2  0.063812
4    1  1  2  0.093606
5    2  1  2  0.001477
6    0  1  3  0.097736
7    1  1  3  0.043157
8    2  1  3  0.077699

edited Nov 29 '19 at 15:41

answered Nov 29 '19 at 15:30

ansev

30,322
5
17
31

1

Thanks for your reply! Is there a way to still include the value of the cat column though? Meaning in the final result there would be 4 columns total? Other than that this is the perfect solution – sig5 Nov 29 '19 at 15:39
yes, you could use `groupby.cumcount` . I have updated my solution! – ansev Nov 29 '19 at 15:42
2

it is index names I think – jezrael Nov 29 '19 at 15:47

score 0 · Answer 3 · edited May 28 '22 at 00:00

One option is with pivot_longer from pyjanitor, using the .value placeholder:

# pip install pyjanitor
import pandas as pd
import janitor 

result = (df
         .add_suffix('_mouse')
         .pivot_longer(
             names_to = ('u', 'i', '.value'), 
             names_sep = '_', 
             names_transform = int,
             ignore_index = False)
         .rename_axis(index = 'cat')
         )

result.head(10)

     u  i     mouse
cat
0    1  1  0.049400
1    1  1 -0.169837
2    1  1 -0.192915
0    1  2  0.063812
1    1  2  0.093606
2    1  2  0.001477
0    1  3  0.097736
1    1  3  0.043157
2    1  3  0.077699
0    1  4 -0.077222

The .value placeholder determines which parts of the column names stays as headers, while the rest are cumulated into u and I columns; the names_transform parameter offers an efficient way to apply astype (which is what is used under the hood) to the u and I columns.

Pandas wide_to_long with list of stubs

3 Answers3

Related