8

This is my DATA in dataframe "df":

Document    Name    Time
SPS2315511  A   1 HOUR
SPS2315512  B   1 - 2 HOUR
SPS2315513  C   2 - 3 HOUR
SPS2315514  C   1 HOUR
SPS2315515  B   1 HOUR
SPS2315516  A   2 - 3 HOUR
SPS2315517  A   1 - 2 HOUR

I am using the below code which gives me the summary of count in the pivot table,

table = pivot_table(df, values=["Document"],
                    index=["Name"], columns=["Time"],
                    aggfunc=lambda x: len(x),
                    margins=True, dropna=True)

but what i want is the % of row calculation as in excel pivot when you right click the pivot and select "show value as -> % of Row Total" . Since my Document is a non-numeric value i was not able to get it.

EXPECTED RESULT :

Count of Document   Column Labels

Name    1 HOUR  1 - 2 HOUR  2 - 3 HOUR  Grand Total
A   33.33%  33.33%  33.33%  100.00%
B   50.00%  50.00%  0.00%   100.00%
C   50.00%  0.00%   50.00%  100.00%
Grand Total 42.86%  28.57%  28.57%  100.00%

Can any one please help me figure out a way to get this result??

i am trying to manipulate the pivot data which will give me the row total,not the data from the dataframe and what i wanted is "% of row total". And also most importantly all my data are non-numeric values...

yoonghm
  • 4,198
  • 1
  • 32
  • 48
keerthi kumar
  • 105
  • 1
  • 2
  • 6
  • 1
    possible duplicate of [How to divide the value of pandas columns by the other column](http://stackoverflow.com/questions/15916612/how-to-divide-the-value-of-pandas-columns-by-the-other-column) – maxymoo Jun 26 '15 at 04:31
  • you can actually just pass `aggfunc=len`, since `len` is already a function :) – maxymoo Jun 26 '15 at 04:31
  • Hi maxymoo in the link you have given they are manipulating one of the column from the dataframe, but my question is different i am trying to manipulate the pivot data which will give me the row total and what i wanted is "% of row total". And also most importantly all my data are non-numeric values... – keerthi kumar Jun 26 '15 at 07:08
  • @maxymoo Thanks for the link, it actually helped me figure out a way to get a solution... I used table = table.div(table.Document["All"], axis='index') – keerthi kumar Jun 26 '15 at 16:15

2 Answers2

14

The possible duplicate noted by @maxymoo is pretty close to a solution, but I'll go ahead and write it up as an answer since there are a couple of differences that are not completely straightforward.

table = pd.pivot_table(df, values=["Document"],
                       index=["Name"], columns=["Time"], 
                       aggfunc=len, margins=True, 
                       dropna=True, fill_value=0)

       Document                      
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name                                 
A             1      1          1   3
B             1      1          0   2
C             0      1          1   2
All           2      3          2   7

The main tweak there is to add fill_value=0 because what you really want there is a count value of zero, not a NaN.

Then you can basically use the solution @maxymoo linked to, but you need to use iloc or similar b/c the table columns are a little complicated now (being a multi-indexed result of the pivot table).

table2 = table.div( table.iloc[:,-1], axis=0 )

       Document                         
Time 1 - 2 HOUR    1 HOUR 2 - 3 HOUR All
Name                                    
A      0.333333  0.333333   0.333333   1
B      0.500000  0.500000   0.000000   1
C      0.000000  0.500000   0.500000   1
All    0.285714  0.428571   0.285714   1

You've still got some minor formatting work to do there (flip first and second columns and convert to %), but those are the numbers you are looking for.

Btw, it's not necessary here, but you might want to think about converting 'Time' to an ordered categorical variable, which would be one way to solve the column ordering problem (I think), but may or may not be worth the bother depending on what else you are doing with the data.

yoonghm
  • 4,198
  • 1
  • 32
  • 48
JohnE
  • 29,156
  • 8
  • 79
  • 109
1

you can use something like this

df = pd.DataFrame({'Document':['SPS2315511','SPS2315512','SPS2315513','SPS2315514','SPS2315515','SPS2315516','SPS2315517'],
                  'Name':['A','B','C','C','B','A','A'],
                  'Time': ['1 HOUR','1 - 2 HOUR','2 - 3 HOUR','1 HOUR','1 HOUR','2 - 3 HOUR','1 - 2 HOUR']})
pd.crosstab(index= df.Name, columns= df.Time, values= df.Document, aggfunc = np.count_nonzero
            ,margins=True,margins_name='Total',normalize= 'index')
sushmit
  • 4,369
  • 2
  • 35
  • 38