1

I'm trying to reshape this sample dataframe from long to wide format, without aggregating any of the data.

import numpy as np
import pandas as pd

df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date': 
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15', 
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0, 
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})

df['Date'] = pd.to_datetime(df['Date']); df

    Date    SubjectID   Var1    Var2    Var3
0   2010-03-14  A   1   0   NaN
1   2010-03-15  A   12  0   1.0
2   2010-03-16  A   4   1   0.0
3   2010-03-14  B   7   1   NaN
4   2010-05-15  B   90  1   0.0
5   2010-03-14  C   1   0   1.0
6   2010-03-14  A   9   1   NaN

To get around the duplicate values, I'm grouping by the "Date" column and getting the cumulative count for each value. Then I make a pivot table

df['idx'] = df.groupby('Date').cumcount()

dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp 

    Var1    Var2    Var3
idx 0   1   2   3   0   1   2   3   0   2
SubjectID                                       
A   5.666667    NaN NaN 9.0 0.333333    NaN NaN 1.0 0.5 NaN
B   90.000000   7.0 NaN NaN 1.000000    1.0 NaN NaN 0.0 NaN
C   NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0

However, I want the idx column index to be the values from the "Date" column and I don't want to aggregate any data. The expected output is

     Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14    Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID                                       
A   1   9   12  4   NaN 0   1   0    1.0    NaN NaN NaN 1.0 0.0 NaN
B   7.0 NaN NaN NaN 90  1   NaN NaN  1.0    NaN NaN NaN NaN NaN 0.0
C   1   NaN NaN NaN NaN 0   NaN NaN  NaN    NaN 1.0 NaN NaN NaN NaN

How can I do this? Eventually, I'll merge the two column indexes by dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns].

m13op22
  • 2,168
  • 2
  • 16
  • 35

1 Answers1

2

You are on the correct path:

# group
df['idx'] = df.groupby('Date').cumcount()

# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])

# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]

I think this is your expected output

Using map looks like it will be a little faster:

df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)

Here is a 50,000 row test example:

#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)

# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]

# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)

# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)

# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • a little late, but this dataset is a small sample of a larger dataframe I have. Is there a way to speed this up? – m13op22 Jan 17 '19 at 18:00
  • 1
    If the dataframe is large then the issue is the last line where you covert the multi index column to a single column with a for loop. When I get back to my computer I will take a look. – It_is_Chris Jan 17 '19 at 20:12
  • thanks for the update. That's a good solution, but my problem is actually using the `unstack` function. I have roughly 50,000 rows but about 3,000 columns. I think it's just a memory issue, so I'm attempting to split the big dataframe into chunks, unstack the chunks, then concatenate the chunks back together. – m13op22 Jan 22 '19 at 19:36