1

import pandas as pd

data = [['4/04/2023','Floors','Tables','Roof','Paint','Paint'],['4/05/2023','Roof','Floors','Tables','Roof','Paint'],['4/06/2023','Paint','','Floors','Tables','Roof'],['4/07/2023','Roof','Paint','','Floors','Tables']]

df = pd.DataFrame(data, columns=['Date', 'Jim','Bob','Ed','James','Joe'])

df2 = df.melt(id_vars="Date", value_name='Items').dropna().reset_index(drop=True)

df3 = df2.pivot(index="Date", columns='Items', values='variable') df3 = df3[["Floors","Tables","Roof","Paint"]] #change the order

print(df)

print(df3)

I want this table: enter image description here

To look like this: enter image description here

I would like Pandas to detect duplicates and make an extra column where there are duplicates rather than throwing the following error:

raise ValueError("Index contains duplicate entries, cannot reshape")

ValueError: Index contains duplicate entries, cannot reshape

James C
  • 55
  • 4

3 Answers3

2

You can use:

data = [['4/04/2023','Floors','Tables','Roof','Paint','Paint'],
        ['4/05/2023','Roof','Floors','Tables','Roof','Paint'],
        ['4/06/2023','Paint','','Floors','Tables','Roof'],
        ['4/07/2023','Roof','Paint','','Floors','Tables']]
df = pd.DataFrame(data, columns=['Date', 'Jim','Bob','Ed','James','Joe'])

out = (
 df.replace('', np.nan) # get rid of empty strings
   # reshape to long
   .melt(id_vars="Date", value_name='Items').dropna().reset_index(drop=True)
   # deduplicate
   .assign(n=lambda d: d.groupby(['Date', 'Items']).cumcount().add(1))
   .pivot(index="Date", columns=['Items', 'n'], values='variable')
   # flatten the MultiIndex
   .pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}_{x[1]}'), axis=1))
   # fill NaN with empty strings and reset index
   .fillna('').reset_index()
)

print(out)

Output:

        Date Floors_1 Roof_1 Paint_1 Tables_1 Roof_2 Paint_2
0  4/04/2023      Jim     Ed   James      Bob            Joe
1  4/05/2023      Bob    Jim     Joe       Ed  James        
2  4/06/2023       Ed    Joe     Jim    James               
3  4/07/2023    James    Jim     Bob      Joe               
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Sorry to be obtuse, but where exactly in my code do I put that? – James C Apr 07 '23 at 14:39
  • @JamesC this is the whole code, put it after `df = pd.DataFrame(data, columns=['Date', 'Jim','Bob','Ed','James','Joe'])`. I updated the answer for clarity – mozway Apr 07 '23 at 14:42
0

Code

# for each date convert columns to rows 
s = df.melt('Date').query("value != ''")

# add a supplementary index to identify 
# unique rows per 'date' and 'value'
s['idx'] = s.groupby(['Date', 'value']).cumcount().add(1).astype('str')

# Reshape the dataframe back to wide format
s = s.pivot(index='Date', columns=['value', 'idx'], values='variable')

# Flatten the multiindex columns
s.columns = s.columns.map('_'.join)

Result

          Floors_1 Roof_1 Paint_1 Tables_1 Roof_2 Paint_2
Date                                                     
4/04/2023      Jim     Ed   James      Bob    NaN     Joe
4/05/2023      Bob    Jim     Joe       Ed  James     NaN
4/06/2023       Ed    Joe     Jim    James    NaN     NaN
4/07/2023    James    Jim     Bob      Joe    NaN     NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
0

You can use pivot if you add a suffix to duplicate entries as your output suggests:

add_suffix = lambda x: x['value'] + (-x.groupby(['Date', 'value']).cumcount().add(1)).astype(str).str.replace('-1', '')

out = (df.melt('Date').replace('', np.nan).dropna().assign(value=add_suffix)
         .pivot(index='Date', columns='value', values='variable').fillna('')
         .reset_index().rename_axis(columns=None))

Output:

>>> out
       Date Floors  Paint Paint-2 Roof Roof-2 Tables
0  4/4/2023    Jim  James     Joe   Ed           Bob
1  4/5/2023    Bob    Joe          Jim  James     Ed
2  4/6/2023     Ed    Jim          Joe         James
3  4/7/2023  James    Bob          Jim           Joe
Corralien
  • 109,409
  • 8
  • 28
  • 52