0

I have a DataFrame with many descriptive columns and a couple of value columns, such as the following where val1 and val2 are values and everything else describes those values:

In [58]: countries = ['X', 'Y', 'Z']; sectors = ['A', 'B', 'C']
In [59]: ch = np.random.choice
In [61]: df = pd.DataFrame(dict(c=ch(countries, 100), s1=ch(sectors, 100), s2=ch(sectors, 100), is_good=ch([True, False], 100), val1=np.random.random(100), val2=np.random.random(100)))
In [62]: df.head()
Out[62]: 
   c is_good s1 s2      val1      val2
0  Z    True  B  B  0.694949  0.145197
1  X    True  A  A  0.319551  0.548003
2  X   False  A  C  0.946967  0.220035
3  X   False  B  A  0.998087  0.902530
4  Y   False  B  C  0.303517  0.660556

Say I'm not interested in whether a value is "good" or not, so I want to sum the value columns over is_good.

Normally I'd do:

In [63]: df.set_index(['c', 's1', 's2', 'is_good']).sum(level=['c', 's1', 's2'])
Out[63]: 
             val1      val2
c s1 s2                    
X A  A   2.170132  1.999072
     B   2.038995  3.528096
     C   4.041300  3.150400
  B  A   2.418448  2.011886
     B   1.212810  1.489008
     C   0.284823  0.961956

but this gets a bit clumsy and error-prone when there are lots of descriptive columns. (It's kind of fine here, but if there were a couple more, it'd already be arduous.)

I'm really looking for an operation which specifies the is_good column, rather than specifying everything but that column.

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201

1 Answers1

1

You could set it up like this:

val_cols = ['val1', 'val2']
descriptive_cols = df.columns.difference(val_cols)
group_cols = descriptive_cols.difference(['is_good'])
df.groupby(group_cols)[val_cols].sum()

Once you make the initial definitions then you can replace 'is_good' with any other column or subset of columns that you're not interested in so you would only have to do it once.

JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • +1 for the use of `df.columns.difference`. Never seen that before! I always do `[c for c in df.columns if c not in (some_list)]` but this is slicker! You don't even seem to need the `set()` bits. – LondonRob Jul 28 '15 at 10:21
  • Yeah, you can also use set operations on indices like `&, -, |, ^` but there are deprecation warnings. It recommends to use the named methods instead, such as `.intersection, .diff, .union, .sym_diff` etc. I've edited the answer to remove the unnecessary `set` calls – JoeCondron Jul 28 '15 at 10:36
  • To clarify, all of these operation can be done on both row and column indices – JoeCondron Jul 28 '15 at 10:43