Duplicate entries and rename column row in pandas pivot table without aggregation

Question

I'm trying to reshape this sample dataframe from long to wide format, without aggregating any of the data.

import numpy as np
import pandas as pd

df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date': 
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15', 
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0, 
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})

df['Date'] = pd.to_datetime(df['Date']); df

    Date    SubjectID   Var1    Var2    Var3
0   2010-03-14  A   1   0   NaN
1   2010-03-15  A   12  0   1.0
2   2010-03-16  A   4   1   0.0
3   2010-03-14  B   7   1   NaN
4   2010-05-15  B   90  1   0.0
5   2010-03-14  C   1   0   1.0
6   2010-03-14  A   9   1   NaN

To get around the duplicate values, I'm grouping by the "Date" column and getting the cumulative count for each value. Then I make a pivot table

df['idx'] = df.groupby('Date').cumcount()

dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp 

    Var1    Var2    Var3
idx 0   1   2   3   0   1   2   3   0   2
SubjectID                                       
A   5.666667    NaN NaN 9.0 0.333333    NaN NaN 1.0 0.5 NaN
B   90.000000   7.0 NaN NaN 1.000000    1.0 NaN NaN 0.0 NaN
C   NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0

However, I want the idx column index to be the values from the "Date" column and I don't want to aggregate any data. The expected output is

     Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14    Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID                                       
A   1   9   12  4   NaN 0   1   0    1.0    NaN NaN NaN 1.0 0.0 NaN
B   7.0 NaN NaN NaN 90  1   NaN NaN  1.0    NaN NaN NaN NaN NaN 0.0
C   1   NaN NaN NaN NaN 0   NaN NaN  NaN    NaN 1.0 NaN NaN NaN NaN

How can I do this? Eventually, I'll merge the two column indexes by dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns].

Can you share the expected output? – Mayank Porwal Jan 17 '19 at 15:52 — Mayank Porwal, Jan 17 '19 at 15:52
@MayankPorwal oh yes, my bad – m13op22 Jan 17 '19 at 15:53 — m13op22, Jan 17 '19 at 15:53

It_is_Chris · Accepted Answer · 2019-01-17T21:32:22.710

You are on the correct path:

# group
df['idx'] = df.groupby('Date').cumcount()

# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])

# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]

I think this is your expected output

Using map looks like it will be a little faster:

df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)

Here is a 50,000 row test example:

#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)

# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]

# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)

# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)

# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)

a little late, but this dataset is a small sample of a larger dataframe I have. Is there a way to speed this up? — m13op22, Jan 17 '19 at 18:00
If the dataframe is large then the issue is the last line where you covert the multi index column to a single column with a for loop. When I get back to my computer I will take a look. — It_is_Chris, Jan 17 '19 at 20:12
thanks for the update. That's a good solution, but my problem is actually using the `unstack` function. I have roughly 50,000 rows but about 3,000 columns. I think it's just a memory issue, so I'm attempting to split the big dataframe into chunks, unstack the chunks, then concatenate the chunks back together. — m13op22, Jan 22 '19 at 19:36

Duplicate entries and rename column row in pandas pivot table without aggregation

1 Answers1

Linked