0

I have a function that takes in 4 single value inputs to return a singular float output, for example:

from scipy.stats import multivariate_normal

grid_step = 0.25 #in units of sigma
grid_x, grid_y = np.mgrid[-2:2+grid_step:grid_step, -2:2+grid_step:grid_step]
pos = np.dstack((grid_x, grid_y))
rv = multivariate_normal([0.0, 0.0], [[1.0, 0], [0, 1.0]])
grid_pdf = rv.pdf(pos)*grid_step**2
norm_pdf = np.sum(rv.pdf(pos))*grid_step**2

def cal_prob(x, x_err, y, y_err):
    x_grid = grid_x*x_err + x
    y_grid = grid_y*y_err + y
    PSB_grid = ((x_grid>3) & (y_grid<10) & (y_grid < 10**(0.23*x_grid-0.46)))
    PSB_prob = np.sum(PSB_grid*grid_pdf)/norm_pdf
    return PSB_prob

What this function is doing is estimating the probability that some x-y measurement is within some defined limit in x-y space, given x and y's uncertainties. It assumes the uncertainties are Gaussian and uncorrelated. Then, using the pre-made grid_pdf, it checks which grid points (scaled by x_err/y_err and shifted by x/y) are within the defined limit, and multiply the True/False grid by the grid_pdf, normalized by norm_pdf. The probability is given by the sum of the normalized array.

I want this function to be applied element-wise with those 4 inputs stored in 4 separate numpy masked arrays of the same shape, with possibly different masks, then use the function outputs to create a new array of the same shape. Is there a way that doesn't use a for loop?

Thanks!

My current solution is this:

mask1 = np.array([[False, True, False],[True, True, True],[True, False, False]])
mask2 = np.array([[True, True, True],[True, True, False],[False, False, True]])
# the only overlaps should be [0,1], [1,0] and [1,1]

x = np.ma.array(np.random.randn(*mask1.shape), mask=~mask1)
x_err = np.ma.array(np.abs(np.random.randn(*mask1.shape))*0.1, mask=~mask1)

y = np.ma.array(np.random.randn(*mask2.shape), mask=~mask2)
y_err = np.ma.array(np.abs(np.random.randn(*mask2.shape))*0.1, mask=~mask2)

# a combined mask to iterate through
all_mask = x+x_err+y+y_err

prob = np.zeros(mask1.shape)
prob = np.ma.masked_where(np.ma.getmask(all_mask), prob)

for i,xi in np.ma.ndenumerate(all_mask):
    prob[i] = cal_prob(xi, x_err[i], y[i], y_err[i])
  • As long as `cal_prob` only accepts scalar values, you have to call it once for each set of inputs. Hence a loop. Or `np.vectorize`, which has a performance disclaimer. For smallish cases it is slower than the explicit loops, though it scales some what better. I don't know how masked arrays behave around `np.vectorize` – hpaulj Nov 24 '22 at 01:45

1 Answers1

0

A test of np.vectorize with a masked array input:

In [180]: def foo(x):
     ...:     print(x)
     ...:     return 2*x
     ...:     

In [181]: np.vectorize(foo)(np.ma.masked_array([1,2,3],[True,False,True]))
1
1
2
3
Out[181]: 
masked_array(data=[--, 4, --],
             mask=[ True, False,  True],
       fill_value=999999)

In [182]: _.data
Out[182]: array([2, 4, 6])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks. I don't think the function I have is right now is particularly vectorizable. I still slapped on np.vectorize and np.frompyfunc onto it and they did not provide any speed up. I have added the full function to the post, and what I am trying to do. – Ho-Hin Leung Nov 25 '22 at 15:43
  • `np.vectorize` has a clear performance disclaimer. But sometimes it's useful for the calling convenience, especially if you want to 'broadcast' the arguments against each other. "true vectorization' requires major surgery to the function itself. Using masked arrays further complicates that, since most `numpy` functions don't do anything special with the mask. – hpaulj Nov 25 '22 at 16:20