2

I would like to partially "collapse" a DataFrame/matrix and keep the structure intact by just summing the condensed values. For example, I have this:

CHROM     POS     GENE     DESC     JOE      FRED   BILLY    SUSAN    TONY
10        1442    LOXL4    bad      1        0      0        1        0
10        335     LOXL4    bad      1        0      0        0        0
10        3438    LOXL4    good     0        0      1        0        0
10        4819    PYROXD2  bad      0        1      0        0        0
10        4829    PYROXD2  bad      0        1      0        1        0
10        9851    HPS1     good     1        0      0        0        0

The first 4 columns are descriptors, and the last 4 columns are people/observations. The end goal is to count how many total "good" and "bad" observations per GENE per person. Thus, I want this:

GENE     DESC     JOE      FRED   BILLY    SUSAN    TONY
LOXL4    bad      2        0      0        1        0
LOXL4    good     0        0      1        0        0
PYROXD2  bad      0        2      0        1        0
HPS1     good     1        0      0        0        0

The following code collapses all the individual observations (Joe, Fred, etc), how can I keep them separate? I would also like to be flexible enough to accommodate a more individuals in the future (keeping the same 4 descriptor columns)

mytable.groupby(['GENE','DESC']).size()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
alexhli
  • 399
  • 2
  • 6
  • 20

1 Answers1

4

Just use the aggregate method of the groupby object:

In [156]: df
Out[156]: 
   CHROM   POS     GENE  DESC  JOE  FRED  BILLY  SUSAN  TONY
0     10  1442    LOXL4   bad    1     0      0      1     0
1     10   335    LOXL4   bad    1     0      0      0     0
2     10  3438    LOXL4  good    0     0      1      0     0
3     10  4819  PYROXD2   bad    0     1      0      0     0
4     10  4829  PYROXD2   bad    0     1      0      1     0
5     10  9851     HPS1  good    1     0      0      0     0

In [157]: grouped = df.groupby(['GENE', 'DESC'])

In [158]: grouped.agg(np.sum) # agg is a shortcut for aggregate
Out[158]: 
              CHROM   POS  JOE  FRED  BILLY  SUSAN  TONY
GENE    DESC                                            
HPS1    good     10  9851    1     0      0      0     0
LOXL4   bad      20  1777    2     0      0      1     0
        good     10  3438    0     0      1      0     0
PYROXD2 bad      20  9648    0     2      0      1     0

As mentioned by Daniel Velkow in the comment, the groupby object has some "build in" methods for simple aggregations like sum, mean, ... (something like ufuncs in numpy which are available as methods for numpy arrays). So the last step could be further simplified to

In [159]: grouped.sum()
Out[159]: 
              CHROM   POS  JOE  FRED  BILLY  SUSAN  TONY
GENE    DESC                                            
HPS1    good     10  9851    1     0      0      0     0
LOXL4   bad      20  1777    2     0      0      1     0
        good     10  3438    0     0      1      0     0
PYROXD2 bad      20  9648    0     2      0      1     0

If you want different operations on each column, according to the docs you can pass a dict to aggregate.

However I found no way to specify a function for a single column and use a default for others. So one way would be to define a custom aggregation function:

def custom_agg(s, default=np.sum, other={}):
    if s.name in other.keys():
        return other[s.name](s)
    else:
        return default(s)

and than apply it by passing the function and the args to agg:

In [59]: grouped.agg(custom_agg, default=np.sum, other={'CHROM': np.mean})
Out[59]: 
              CHROM   POS  JOE  FRED  BILLY  SUSAN  TONY
GENE    DESC                                            
HPS1    good     10  9851    1     0      0      0     0
LOXL4   bad      10  1777    2     0      0      1     0
        good     10  3438    0     0      1      0     0
PYROXD2 bad      10  9648    0     2      0      1     0
bmu
  • 35,119
  • 13
  • 91
  • 108
  • 1
    Also `grouped.sum()` is a shorthand for `grouped.agg(np.sum)` – Daniel Aug 10 '12 at 19:22
  • 1
    As a followup question, can you perform different operations on each column when aggregating? For example if I wanted to sum the last 5 columns as given, but take the average of the "CHROM" column instead of summing those values? – alexhli Aug 13 '12 at 17:35
  • 1
    @alexhli I updated my answer. Usually passing a dict would be enough, but you must name all columns. If you want a default for the rest of the columns, I think you need a custom function. – bmu Aug 14 '12 at 05:43