2

I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:

stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
    stdleft[jpar] = p[p[:,jpar] < \
                      mean[jpar],jpar].std()

where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.

Any idea would be really helpfull!

Thank you

zhangxaochen
  • 32,744
  • 15
  • 77
  • 108
mcave
  • 121
  • 2
  • 7

2 Answers2

2

Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this

mat = pandas.DataFrame(p)

This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.

m = mat.mean()

Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.

stdleft = mat[mat < m].std()

Similarly for stdright. Take a couple of minutes to compute on my machine.

Here's the doc page for pandas: http://pandas.pydata.org/

Edit: Edited using the comment below. You can do almost similar indexing using the original p.

m = p.mean(axis=0)
logical = p < m

logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?

for i in range(len(p)):
    stdleft[i] = p[logical[:, i], i].std()
Sudeep Juvekar
  • 4,898
  • 3
  • 29
  • 35
  • 2
    There's no need for pandas for that. It's equivalent to doing `m = p.mean(axis=0); (p < m).std(axis=0)`. However, I don't think you're getting the result you think you are. You're currently calculating the standard deviation of a boolean array. I.e. you're not calculating the standard deviation of the _values_ where `mat < m`, just the standard deviation of the _number of values_ that satisfy that criteria in each column. – Joe Kington Feb 28 '14 at 18:02
  • `m = p.mean()` will result in a _single number_. `m = p.mean(axis=0)` will give you the _means for each column_. (similar to pandas) You can directly index a numpy array with a logical array of the same size. You don't need the loop. However, in this case, for the pure-numpy solution, you'd use masked arrays. – Joe Kington Feb 28 '14 at 18:21
2

As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.

In numpy, it's easiest to use masked arrays for this.

As an example:

import numpy as np

# 10 samples, 3 columns
p = np.random.random((10, 3))

# Calculate the mean of each column
colmeans = p.mean(axis=0)

# Make a boolean array where our condition is True
mask = p < colmeans

# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)

You can also use pandas for this as @SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).

Joe Kington
  • 275,208
  • 71
  • 604
  • 463