3

Could someone help me to achieve this task? I have data in multi-level data frame through the unstack() operation:

Original df:
Density  Length  Range  Count
  15k    0.60  small    555
  15k    0.60    big     17
  15k    1.80  small    141
  15k    1.80    big     21
  15k    3.60  small    150
  15k    3.60    big     26
  20k    0.60  small   5543
  20k    0.60    big     22
  20k    1.80  small    553
  20k    1.80    big     25
  20k    3.60  small    422
  20k    3.60    big     35

df  = df.set_index(['Density','Length','Range']).unstack('Range')

# After unstack:
                  Count       
Range             big  small
Density Length              
15k     0.60       17    555
        1.80       21    141
        3.60       26    150
20k     0.60       22   5543
        1.80       25    553
        3.60       35    422

Now I try to add an extra column in level 1. it is the ratio of the small/big. I have tried the following syntax, no error but with different outcomes

#df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio
#df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio
#df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0
df['ratio']=df.ix[:,1]/df.ix[:,0]                    ## case 4. no error, ratio on column level 0

#After execution above code, df:
                  Count         ratio
Range             big  small       
Density Length                     
15k     0.60       17    555  32.65
        1.80       21    141   6.71
        3.60       26    150   5.77
20k     0.60       22   5543 251.95
        1.80       25    553  22.12
        3.60       35    422  12.06

I don't understand why case 1 & 2 show no error neither adding new ratio column. and why in case 3 & 4 the ratio column is on level 0, not the expected level 1. Also like to know if there is a better/concise way to achieve this. Case 4 is the best I can do but I don't like the implicit indexing way (instead of using the name) to refer to a column.

Thanks

user3287545
  • 1,911
  • 5
  • 20
  • 19

1 Answers1

5

Case 1:

df[:]['ratio']=df['Count']['small']/df['Count']['big'] 

df[:] is a copy of df. They are different objects, each with its own copy of the underlying data:

In [69]: df[:] is df
Out[69]: False

So modifying the copy has no effect on the original df. Since no reference is maintained for df[:], the object is garbage collected after the assignment, making the assignment useless.


Case 2:

df['Count']['ratio']=df['Count']['small']/df['Count']['big'] 

uses chain-indexing. Avoid chain indexing when making assignments. The link explains why assignments using chain-indexing on the left-hand side may not affect df.

If you set

pd.options.mode.chained_assignment = 'warn'

then Pandas will warn you not to use chain-indexing in assignments:

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Case 3:

df['ratio']=df['Count']['small']/df['Count']['big'] 

and Case 4

df['ratio']=df.ix[:,1]/df.ix[:,0]

both work, but it could done more efficently using

df['ratio'] = df['Count','small']/df['Count','big']

Here is a microbenchmark showing that using df[tuple_index] is faster than chain-indexing:

In [99]: %timeit df['Count']['small']
1000 loops, best of 3: 501 µs per loop

In [128]: %timeit df['Count','small']
100000 loops, best of 3: 8.91 µs per loop

If you want ratio to be the level 1 label, then you must tell Pandas that the level 0 label is Count. You can do that by assigning to df['Count','ratio']:

In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big']

# In [97]: df
# Out[97]: 
#                Count                  
# Range            big small       ratio
# Density Length                        
# 15k     0.6       17   555   32.647059
#         1.8       21   141    6.714286
#         3.6       26   150    5.769231
# 20k     0.6       22  5543  251.954545
#         1.8       25   553   22.120000
#         3.6       35   422   12.057143
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • thanks unutbu. your answer addresses many of my pandas-ignorance. really appreciate – user3287545 Oct 18 '14 at 18:44
  • Great explanation! Small comment: for the actual solution, can't you just use `df[('Count','ratio')]` instead of `df.loc[:, ('Count','ratio')]` ? – joris Oct 18 '14 at 18:49
  • Joris comment works too. The shortest and most explicit way to calculate the ratio as I gather so far is: `df['Count','ratio']=df['Count','small']/df['Count','big']` Thanks you both – user3287545 Oct 18 '14 at 19:32