Pandas: reshape data frame

Question

I have the following data frame:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

zz=pd.read_csv(url)
zz.head(5)

    date    feccandid   feccandcfscore.dyn  pacid   paccfscore  cid     catcode     type_x  di  amtsum  state   log_diff_unemployment   party   type_y  bills   years_exp   disposition     billsum
0   2006    S8NV00073   0.496   C00000422   0.330   N00006619   H1100   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     3
1   2006    S8NV00073   0.496   C00375360   0.176   N00006619   H1100   24K     D   4500    NV  -0.024693   Republican  rep     s22-109     12  support     3
2   2006    S8NV00073   0.496   C00113803   0.269   N00006619   H1130   24K     D   2500    NV  -0.024693   Republican  rep     s22-109     12  support     2
3   2006    S8NV00073   0.496   C00249342   0.421   N00006619   H1130   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     2
4   2006    S8NV00073   0.496   C00255752   0.254   N00006619   H1130   24K     D   4000    NV  -0.024693   Republican  rep     s22-109     12  support     2

I want to manipulate it such that the date column is an index, the feccandid values are the column headers (I will later make them a second index so I can send the frame to panel) and the other column headers become rows. Desired output would look something like this:

date    feccandid              S8NV00072    S8NV00074   S8NV00075   S8NV00076   S8NV00077
2006    feccandcfscore.dyn        0.496        0.496        0.496     0.496       0.496
2006    pacid                  C00000422    C00375360   C00113803   C00249342   C00255752
2006    paccfscore                  0.33        0.176      0.269         0.421    0.254
2006    cid N00006619           N00006619   N00006619   N00006619   N00006619
2006    catcode                  H1100      H1100          H1130    H1130      H1130
2006    type_x                    24K         24K            24K    24K     24K
2006    di                           D          D              D        D       D
2006    amtsum                      5000      4500          2500        5000       4000
2006    state                        NV        NV           NV        NV         NV
2006    log_diff_unemployment   -0.024693   -0.024693   -0.024693   -0.024693   -0.024693
2006    party                     Republican    Republican  Republican  Republican  Republican
2006    type_y                            rep         rep         rep       rep      rep
2006    bills                           s22-109      s22-109    s22-109    s22-109     s22-109
2006    years_exp                             12        12        12       12      12
2006    disposition                      support       support  support support support
2006    billsum                            3               3        2      2       2

I have tried the following as recommended by jezrael

zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)

zz.head()

    feccandcfscore.dyn  ...     billsum
feccandid   H0AL02087   H0AL07060   H0AR01083   H0AR02107   H0AR03055   H0AR04038   H0AZ01259   H0AZ03362   H0CA15148   H0CA19173   ...     S8MI00158   S8MN00438   S8MS00055   S8MT00010   S8NC00239   S8NE00117   S8NM00010   S8NV00073   S8OR00207   S8WI00026
date                                                                                    
2005    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2006    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     2.125   NaN     NaN
2007    NaN     0.016   NaN     NaN     NaN     -0.151  NaN     NaN     -0.777  NaN     ...     1.000000    NaN     1.666667    1.552632    NaN     NaN     2.0     1.000   NaN     2.0
2008    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     1.285714    NaN     NaN     5.431373    NaN     NaN     NaN     NaN     NaN     NaN
2009    NaN     NaN     NaN     NaN     NaN     -0.086  NaN     NaN     -0.790  NaN     ...     NaN     NaN     NaN     2.433333    NaN     NaN     NaN     NaN     3.0     2.8

This is something close to what I would like except that I'm trying to get the feccandidas the only column headers and the original column headers (which are--in this last example--as the topmost column headers) to betransposed as rows.

Do you want to pass modified (aggregated by pivot_table) elements to a panel? If so, then you can reorder labels in the pivoted table using: `zz.columns = zz.columns.reorder_levels((1,0))`. Afterwards, you can send it to panel with `zz.T.to_panel()` and then `swapaxis()`. If you'd prefer to keep all the elements untouched, then it's also doable (I can write some code later) but I'm not sure if the size of the resulting panel wouldn't bloat. — ptrj, May 04 '16 at 02:51
@ptrj: I just saw you message. I will get back to you on this later today. I'm going to try and change a few things with the data. — Collective Action, May 04 '16 at 03:55
@ptrj: Thank you for your comment! I don't think that it is necessary to keep the elements untouched. I'm going to work towards your suggestion. — Collective Action, May 04 '16 at 05:44

score 1 · Accepted Answer · edited May 23 '17 at 11:52

1

I think you can use pivot_table (default aggregate function is np.mean):

df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df

If you need replace NaN to 0:

print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)

EDIT:

I created small sample DataFrame As ptrj says, you can use T and to_panel for creating panel. Then maybe you need transpose:

import pandas as pd

zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002}, 
                   'feccandid': {0: 'S8NV00072', 1: 'S8NV00074', 
                                 2: 'S8NV00072', 3: 'S8NV00074'}, 
                   'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
                   'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})

print zz
   billsum  date  feccandid  pacid
0        1  2001  S8NV00072    0.3
1        2  2001  S8NV00074    0.1
2        5  2002  S8NV00072    0.7
3        6  2002  S8NV00074    0.4

zz = zz.pivot_table(index='date', 
                         columns='feccandid',
                         fill_value=0, 
                         aggfunc=np.mean)
print zz.T   
date               2001  2002
        feccandid            
billsum S8NV00072   1.0   5.0
        S8NV00074   2.0   6.0
pacid   S8NV00072   0.3   0.7
        S8NV00074   0.1   0.4

wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074

print wp.transpose(2, 0, 1)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid

edited May 23 '17 at 11:52

Community

1
1

answered May 03 '16 at 18:59

jezrael

822,522
95
1,334
1,252

That almost works! Unfortunately it puts the `feccanid` values as column headers, but it places them underneath the existing column headers. I'll post the result as an edit. – Collective Action May 03 '16 at 19:07
I just applied you edit but I'm getting the following error: `TypeError: Must pass list-like as names.` – Collective Action May 03 '16 at 19:18
1

No problem. You get multiindex in columns. What is desired output? – jezrael May 03 '16 at 19:20
Desired output: Send it to panel with `feccandid` as `Items axis` individual panels, `date` as `Major_axis`, and all the other columns as `minor_axis`. – Collective Action May 03 '16 at 19:24
I applied your edit, still not there. I think I need to put up a small sample out put. I'm going to start with that now. – Collective Action May 03 '16 at 19:26
1

I try add sample to my answer. Please check it if it is what you want. Btw, this method is [here](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dataframe-using-to-panel-method). – jezrael May 04 '16 at 05:42
Thanks for staying with me on this! I have class pretty soon but as soon as I have a chance I'm going to apply your suggestion! – Collective Action May 04 '16 at 05:49
Super. I think the best is working with small sample of data. I dont use `panel` before, but I try help you if there can be some problem. – jezrael May 04 '16 at 05:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/110952/discussion-between-michael-perdue-and-jezrael). – Collective Action May 04 '16 at 05:58

Pandas: reshape data frame

1 Answers1