How to use pd.melt to unpivot a dataframe where columns share a prefix?

Question

I'm trying to unpivot my data using pd.melt but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.

My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/ for 0 < n < 9. (reviews/0/text, reviews/0/date, ... , reviews/9/date). The next 65 columns in the dataframe include more data about the business (e.g. business_id, address) that should remain as id_variables.

My current data looks like this:

business_id	address	reviews/0/date	reviews/0/text	reviews/1/date	reviews/1/text
12345	01 street	1/1/1990	"abc"	2/2/1995	"def"

and my new dataframe should have every review as a row instead of every business, and look like this:

business_id	address	review_number	review_date	review_text
12345	01 street	0	1/1/1990	"abc"
12345	01 street	1	2/2/1995	"def"

I tried using pd.melt but could not succeed in making code that produced something valuable to me.

buddemat · Accepted Answer · 2023-04-02T00:28:39.273

You can use pandas.wide_to_long() to do what you want.

However, you will need to rename your columns from the pattern reviews/N/COL to reviews/COL/N (or something similar) first, as wide_to_long() can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.

You could do this manually or e.g. using the re module and an appropriate regex:

df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))

After that, your data should look like this (note the changed colnames):

business_id	address	review_date0	review_text0	review_date1	review_text1
12345	01 street	1/1/1990	abc	2/2/1995	def

Now you can use pandas.wide_to_long() and use the stubnames parameter to specify the prefix of the columns that should be grouped when you unpivot.

df = pd.wide_to_long(df,
                     stubnames=['review_date','review_text'],
                     i=['business_id', 'address'], 
                     j='review_number')

Finally, call .reset_index() to achieve the result you asked for.

Full example:

import re
import pandas as pd

df = pd.DataFrame({'business_id': 12345, 
                   'address': '01 street', 
                   'reviews/0/date': '1/1/1990', 
                   'reviews/0/text': 'abc', 
                   'reviews/1/date': '2/2/1995', 
                   'reviews/1/text': 'def'}, index = [0])

df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))

df = pd.wide_to_long(df,
                     stubnames=['review_date','review_text'],
                     i=['business_id', 'address'], 
                     j='review_number').reset_index()

Result:

business_id	address	review_number	review_date	review_text
12345	01 street	0	1/1/1990	abc
12345	01 street	1	2/2/1995	def

score 2 · Answer 2 · answered Apr 02 '23 at 01:56

You can get the names of all the non-review columns.

columns = df.columns[~df.columns.str.match(r'reviews/\d+/')]

>>> columns
Index(['address', 'business_id'], dtype='object')

And use those to .melt()

df = df.melt(columns)

df['review_number'] = df['variable'].str.extract(r'reviews/(\d+)')
df['variable'] = df['variable'].str.replace(r'reviews/\d+/', 'review_')

>>> df
  address  business_id     variable                value review_number
0  street            1  review_date  1990-01-01 00:00:00             0
1  street            1  review_text                "abc"             0
2  street            1  review_date  1995-02-02 00:00:00             1
3  street            1  review_text                "def"             1

From there you can .pivot()

>>> df.pivot(index=columns.union(['review_number']).to_list(), columns='variable')
                                                 value            
variable                                   review_date review_text
address business_id review_number                                 
street  1           0              1990-01-01 00:00:00       "abc"
                    1              1995-02-02 00:00:00       "def"

sammywemmy · Answer 3 · 2023-04-02T02:44:40.393

One option is with pivot_longer from pyjanitor - in this case we use the special placeholder .value to identify the parts of the column that we want to remain as headers, while the rest get collated into a new column :

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index = ['business_id', 'address'], 
    names_to = ['.value', 'reviewnumber', '.value'], 
    names_pattern = r"(review)s/(\d+)/(.+)"
 )
.rename(columns = lambda f: f.replace('review', 'review_'))
) 
   business_id    address review_number review_date review_text
0        12345  01 street             0    1/1/1990         abc
1        12345  01 street             1    2/2/1995         def

A regex gives the flexibility to extract the labels into separate columns. Note that you can use .value as many times as you want, as long as you get the regex right.

Another option is with pd.stack, where the columns are split before transforming - as much as possible generally, if you can, split the columns before flipping, not after ( the larger the data size the more performant this option is ):

temp = df.set_index(['business_id', 'address'])
temp.columns = temp.columns.str.split("/", expand=True)
temp.columns.names = [None, 'review_numbers', None]
# quick route - the collapse_levels function 
# is from pyjanitor
# temp.stack('review_numbers').collapse_levels().reset_index()
temp = temp.stack('review_numbers')
temp.columns = temp.columns.map("_".join)
temp.reset_index()
   business_id    address review_numbers reviews_date reviews_text
0        12345  01 street              0     1/1/1990          abc
1        12345  01 street              1     2/2/1995          def

How to use pd.melt to unpivot a dataframe where columns share a prefix?

3 Answers3