4

I have a pandas dataframe with MultiIndex columns, with 3 levels:

import itertools
import numpy as np

def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]


miindex = pd.MultiIndex.from_product([mklbl('A', 4)])

micolumns = pd.MultiIndex.from_tuples(list(itertools.product(['A', 'B'], ['a', 'b', 'c'], ['foo', 'bar'])),
                                      names=['lvl0', 'lvl1', 'lvl2'])

dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns)).reshape((len(miindex), len(micolumns))),
                    index=miindex,
                    columns=micolumns).sort_index().sort_index(axis=1)

lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     1   0   3   2   5   4   7   6   9   8  11  10
A1    13  12  15  14  17  16  19  18  21  20  23  22
A2    25  24  27  26  29  28  31  30  33  32  35  34
A3    37  36  39  38  41  40  43  42  45  44  47  46

I want to mask this dataframe, based on another dataframe, which has the last two levels of the index:

cols = micolumns.droplevel(0).unique()
a_mask = pd.DataFrame(np.random.randn(len(dfmi.index), len(cols)), index=dfmi.index, columns=cols)
a_mask = (np.sign(a_mask) > 0).astype(bool)

        a             b             c       
      foo    bar    foo    bar    foo    bar
A0  False  False  False   True   True  False
A1   True  False   True  False   True   True
A2   True   True   True   True  False  False
A3   True  False  False   True   True  False

What I would like to do is to mask the original dataframe according to a_mask. Let's say I want to set the original entries to zero, when a_mask is true.

I tried to use pd.IndexSlice, but it fails silently (i.e. I can run the following code, but has no effect:

dfmi.loc[:, pd.IndexSlice[:, a_mask]] = 0  #dfmi is unchanged

Any suggestion how to achieve this?

Edit In my use case, the labels are constructed with a cartesian product, so there will be all combinations of (lev0, lev1, lev2). But it is the case that lev0 can assume 2 values {A, B}, while lev1 can assume 3 values {a, b, c}

FLab
  • 7,136
  • 5
  • 36
  • 69

4 Answers4

3

I think using this way is more safe.

dfmi.where(a_mask.loc[:,dfmi.columns.droplevel(0)].values,0)
Out[191]: 
lvl0   A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     0   0   0   2   0   0   0   6
A1     9   8  11   0  13  12  15   0
A2     0  16  19  18   0  20  23  22
A3    25   0   0   0  29   0   0   0
BENY
  • 317,841
  • 20
  • 164
  • 234
2

I would do it as follows:

mask = pd.concat({k: a_mask for k in dfmi.columns.levels[0]}, axis=1)
dfmi.where(~mask, 0)
JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • Thanks, this solution works, although I would wait for other suggestions before accepting. I would suggest the following improvements, to make it slightly more generic. mask = pd.concat(dict.fromkeys(dfmi.columns.get_level_values(0), a_mask), axis=1) dfmi = dfmi.mask(mask, 0) – FLab Nov 27 '17 at 18:44
  • This is a smart solution but, in your updated question, it's still the case that the labels are simply "repeats" (a, b, c / a, b, c) ... if this were ever not the case, the concatenation solution wouldn't work. And I still think `.loc` should be able to be employed here somehow. – Brad Solomon Nov 27 '17 at 18:51
  • For example, if you were to delete `dfmi[('A', 'a', 'foo')]`, you have "uneven" data – Brad Solomon Nov 27 '17 at 18:54
  • clarified in the answer – FLab Nov 27 '17 at 18:59
  • @BradSolomon adding the .loc method . :-) – BENY Nov 27 '17 at 19:12
  • If the data was uneven the `where` method would reindex `mask` as per `dfmi`so should handle that.You could give more control by doing `mask = mask.reindex_like(dfmi, fill_value=)` – JoeCondron Nov 28 '17 at 09:19
1

Working with the underlying array data for in-situ edit for memory efficiency (doesn't create any other dataframe) -

d = len(dfmi.columns.levels[0])
n = dfmi.shape[1]//d
for i in range(0,d*n,n):
    dfmi.values[:,i:i+n][a_mask] = 0

Sample run -

In [833]: dfmi
Out[833]: 
lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     1   0   3   2   5   4   7   6   9   8  11  10
A1    13  12  15  14  17  16  19  18  21  20  23  22
A2    25  24  27  26  29  28  31  30  33  32  35  34
A3    37  36  39  38  41  40  43  42  45  44  47  46

In [834]: a_mask
Out[834]: 
        a             b             c       
      foo    bar    foo    bar    foo    bar
A0   True   True   True  False  False  False
A1  False   True  False  False   True  False
A2  False   True   True   True  False  False
A3  False  False  False  False  False   True

In [835]: d = len(dfmi.columns.levels[0])
     ...: n = dfmi.shape[1]//d
     ...: for i in range(0,d*n,n):
     ...:     dfmi.values[:,i:i+n][a_mask] = 0

In [836]: dfmi
Out[836]: 
lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     0   0   0   2   5   4   0   0   0   8  11  10
A1    13   0  15  14   0  16  19   0  21  20   0  22
A2    25   0   0   0  29  28  31   0   0   0  35  34
A3    37  36  39  38  41   0  43  42  45  44  47   0
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks for your answer. Does your solution rely on the fact that each level of the columns has the same number of labels (2, 2, 2)? – FLab Nov 27 '17 at 18:37
  • @FLab It does ! – Divakar Nov 27 '17 at 18:37
  • Uh...unfortunately this is not the case in my use case (my bad making the example not so generic, I am going to update it). I really liked this answer of yours: https://stackoverflow.com/questions/41493177/pandas-multiply-dataframes-with-multiindex-and-overlapping-index-levels I wonder if it could be modified/generalised for this situation? – FLab Nov 27 '17 at 18:40
  • @FLab Well as long as dfmi.A and dfmi.B have the same shapes, this should work, whatever be their shapes. That's the only requirement. – Divakar Nov 27 '17 at 18:46
  • Another clarification: are you assuming that the top columns level has only two entries? If I had A, B, C, would you do as below? a[:,:n][a_mask] = 0 ; a[:,n:2*n][a_mask] = 0; a[:,2*n:][a_mask] = 0 – FLab Nov 27 '17 at 19:04
  • @FLab Not anymore. Check out the edits. Should work regardless of the number of labels at the outer level - A,B, C, etc. – Divakar Nov 27 '17 at 19:07
  • Thanks a lot, I was now able to run it on my use case – FLab Nov 27 '17 at 19:11
0

Updated solution more roboust not hardcode for level values:

lvl0_values = dfmi.columns.get_level_values(0).unique()
pd.concat([dfmi[i].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0) for i in lvl0_values],
          keys=lvl0_values, axis=1)

Output:

lvl0   A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     1   0   0   0   5   0   0   0
A1     9   0  11   0  13   0  15   0
A2    17  16  19   0  21  20  23   0
A3     0  24   0  26   0  28   0  30

One way you can do this:

pd.concat([dfmi['A'].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0),
           dfmi['B'].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0)],
           keys=['A','B'], axis=1)

print(a_mask)

lvl1      a             b       
lvl2    foo    bar    foo    bar
A0     True  False   True   True
A1     True  False   True  False
A2    False  False   True  False
A3    False   True  False   True

Output:

       A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     1   0   0   0   5   0   0   0
A1     9   0  11   0  13   0  15   0
A2    17  16  19   0  21  20  23   0
A3     0  24   0  26   0  28   0  30
Scott Boston
  • 147,308
  • 15
  • 139
  • 187