Using pd.wide_to_long and having the sep= be parenthesis, getting empty dataframe with the column names only

Question

I am trying to do pd.wide_to_long and I want to separate based off of what is inside a parenthesis. I looked up the pandas manual here and the closes example is below. I changed the column names to match something to what I have.

np.random.seed(0)
df = pd.DataFrame({'A(weekly_2010)': np.random.rand(3),
                   'A(weekly_2011)': np.random.rand(3),
                   'B(weekly_2010)': np.random.rand(3),
                   'B(weekly_2011)': np.random.rand(3),
                   'X' : np.random.randint(3, size=3)})
df['id'] = df.index

I would do

df1=pd.wide_to_long(df, stubnames=['A', 'B'], i=['id'], j='year', sep='()', suffix='\w+')

This returns an empty dataframe. I looked at this example and switched the i and J but it returns a KeyError: "None of [Index(['year'], dtype='object')] are in the [columns]" I have tried the following for sep= and still no luck. Does someone have an idea of how to get the following dataframe.

df1=pd.wide_to_long(df, stubnames=['A', 'B'], i=['id'], j='year', sep='(*)', suffix='\w+')
df1=pd.wide_to_long(df, stubnames=['A', 'B'], i=['id'], j='year', sep='\\()', suffix='\w+')
df1=pd.wide_to_long(df, stubnames=['A', 'B'], i=['id'], j='year', sep=r'\\()', suffix='\w+')
df1=pd.wide_to_long(df11, stubnames=['A', 'B'], i=['id'], j='year', sep=r'\(*\)', suffix='\w+')

The desired dataframe would be

desiredDF= pd.DataFrame({'year':["(weekly_2010)","(weekly_2010)","(weekly_2010)","(weekly_2011)","(weekly_2011)","(weekly_2011)"],
               'A': np.random.rand(6),
               'B': np.random.rand(6),
               'X' : np.random.randint(6, size=6),
               'id':np.random.randint(6, size=6)})

obviously with the same numbers, I just wanted to show the column that matters which is the year column. I need to do pivot because my dataframe is more complicated and doing other formats, messes up the numbers. If anyone knows how to write in the sep, I would greatly appreciate it!

score 1 · Accepted Answer · answered Nov 03 '22 at 21:35

Try this, remove the trailing close parenthesis and separate on open parenthesis:

df.columns = df.columns.str.replace(')','', regex=False)
pd.wide_to_long(df, stubnames=['A', 'B'], i=['id'], j='year', sep='(', suffix='\w+')

Output:

                X         A         B
id year                              
0  weekly_2010  0  0.548814  0.437587
1  weekly_2010  1  0.715189  0.891773
2  weekly_2010  1  0.602763  0.963663
0  weekly_2011  0  0.544883  0.383442
1  weekly_2011  1  0.423655  0.791725
2  weekly_2011  1  0.645894  0.528895

I find that sometimes you need to do a little prep work to get pd.wide_to_long to work properly.

score 1 · Answer 2 · answered Nov 03 '22 at 22:01

@ScottBoston's solution works well. I'd suggest an alternative to pd.wide_to_long that offers more flexibility, while still being performant - pivot_longer from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(index=['X', 'id'], 
                names_to = ('.value', 'year'), 
                names_pattern=r"(.)(.+)") 
   X  id           year         A         B
0  0   0  (weekly_2010)  0.548814  0.437587
1  1   1  (weekly_2010)  0.715189  0.891773
2  1   2  (weekly_2010)  0.602763  0.963663
3  0   0  (weekly_2011)  0.544883  0.383442
4  1   1  (weekly_2011)  0.423655  0.791725
5  1   2  (weekly_2011)  0.645894  0.528895

The .value in this case tells the function to keep that part of the column as a header. This is determined by the argument passed to names_pattern or names_sep.

Sticking to pd.wide_to_long, simply use the generic .+ regex for the suffix parameter:

pd.wide_to_long(df, 
                stubnames = ['A','B'], 
                i = 'id', 
                j = 'year', 
                sep='', 
                suffix='.+') 
                  X         A         B
id year                                
0  (weekly_2010)  0  0.548814  0.437587
1  (weekly_2010)  1  0.715189  0.891773
2  (weekly_2010)  1  0.602763  0.963663
0  (weekly_2011)  0  0.544883  0.383442
1  (weekly_2011)  1  0.423655  0.791725
2  (weekly_2011)  1  0.645894  0.528895

Using pd.wide_to_long and having the sep= be parenthesis, getting empty dataframe with the column names only

2 Answers2