1

I have a matrix I have created by reading a tab-delimited text file using Numpy, it looks something like this:

sample  category_a  category_b  value
------  ----------  ----------  -----
1       A           Z           3.92
2       A           Y           12.43
3       B           Z           5.87
4       B           Y           6.71
etc...

I would like to filter or group the data in order to perform some basic statistics, such as calculating the average value of each value of a single category, or combination of categories. Unfortunately, I am new to Numpy and do not see any obvious reference to this type of functionality in the documentation. Is it possible to group matrix data by category and perform calculations? Or do I need to filter the data going into the matrix when reading from the file and then perform the calculations?

woemler
  • 7,089
  • 7
  • 48
  • 67
  • 1
    You can filter the data like this: http://stackoverflow.com/questions/3030480/numpy-array-how-to-select-indices-satisfying-multiple-conditions Also there is some reference here for in-built functionality, maybe one of the functions does what you need: http://docs.scipy.org/doc/numpy/reference/routines.sort.html – Aleksander Lidtke Dec 18 '13 at 02:27
  • 7
    I would recommend you take a look at [`pandas`](http://pandas.pydata.org/). – BrenBarn Dec 18 '13 at 02:58
  • @BrenBarn: That looks like it might be a better fit for what I need to do than base Numpy. Thanks! – woemler Dec 18 '13 at 11:32

1 Answers1

3

As addition to the comments, this is how you would do it very simply in pandas:

First I import your example data (but of course this would depend on how your data look like):

import pandas as pd
from StringIO import StringIO
s = """sample  category_a  category_b  value
1       A           Z           3.92
2       A           Y           12.43
3       B           Z           5.87
4       B           Y           6.71"""

df = pd.read_csv(StringIO(s), sep="\s+", index_col=0)

you get the following DataFrame:

In [7]: df
Out[7]:
       category_a category_b  value
sample
1               A          Z   3.92
2               A          Y  12.43
3               B          Z   5.87
4               B          Y   6.71

Now, to group the data by a category and taking the mean of each group, you could do:

In [5]: df.groupby('category_a').mean()
Out[5]:
            value
category_a
A           8.175
B           6.290

Or for grouping by multiple categories (in this dummy example, taking the mean of course does not much, as there is only one value in each group):

In [6]: df.groupby(['category_a', 'category_b']).mean()
Out[6]:
                       value
category_a category_b
A          Y           12.43
           Z            3.92
B          Y            6.71
           Z            5.87
joris
  • 133,120
  • 36
  • 247
  • 202