2

Given the following data frame:

import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
                   'x':[1,1,0,1,0,0],
                   'y':[1,np.nan,0,1,1,0]
                   })

df

    Site    y   x
0   a      1.0  1
1   a      NaN  1
2   a      0.0  0
3   b      1.0  1
4   b      1.0  0
5   b      0.0  0

I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column. Here's how I accomplish this for 'y':

df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
    sum     len     Item    Perc
Site                
a   1.0     2.0     y       50.0
b   2.0     3.0     y       66.7

Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:

    sum     len     Item    Perc
Site                
a   1.0     2.0     y       50.0
b   2.0     3.0     y       66.7
a   1       2       x       50.0
b   1       3       x       33.3

In reality, I have 48 such numerical data columns that need to be stacked as such.

Thanks in advance!

Dance Party
  • 3,459
  • 10
  • 42
  • 67

1 Answers1

1

First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:

df=df.loc[df['y'].notnull()]

t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
    .stack()
    .reset_index(level=1)
    .rename(columns={'level_1':'Item'})
    .sort_values('Item', ascending=False)

t['Perc']= (t['sum']/t['len']*100).round(1)

#reorder columns
t = t[['sum','len','Item','Perc']]
print t
      sum  len Item  Perc
Site                     
a     1.0  2.0    y  50.0
b     2.0  3.0    y  66.7
a     1.0  2.0    x  50.0
b     1.0  3.0    x  33.3

Another solution if is neccessary define values columns in pivot_table:

df=df.loc[df['y'].notnull()]

t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
    .stack()
    .reset_index(level=1)
    .rename(columns={'level_1':'Item'})
    .sort_values('Item', ascending=False)

t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
      sum  len Item  Perc
Site                     
a     1.0  2.0    y  50.0
b     2.0  3.0    y  66.7
a     1.0  2.0    x  50.0
b     1.0  3.0    x  33.3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • If I have 4 other string columns in my data frame that do not pertain to the pivot table, is there a way to ignore them when creating the pivot table or do I need to drop them from the data frame beforehand? Thanks! – Dance Party May 13 '16 at 11:56
  • Hmmm I think the easiest solution is drop them. – jezrael May 13 '16 at 12:00