0

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.

import numpy as np

array = np.array([np.random.uniform(0, 10) for i in range(800,)])

# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan

array = array.reshape(50, 16)

bin_values=np.linspace(0, 10, 21)

f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array) 

bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])

values = np.zeros(array.shape[0])

for i in range(array.shape[0]):
    values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])

Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.

Omid
  • 51
  • 3

1 Answers1

0

The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.

The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.

I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.

Please post your solution if you can think of a faster algorithm.

import numpy as np

# creating an array to test the algorithm

array = np.array([np.random.uniform(0, 10) for i in range(800,)])

# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan

array = array.reshape(50, 16)

# the algorithm

bin_values=np.linspace(0, 10, 21)
    
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array) 

bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))

# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)

mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))

mask_nan[mask] = 1

v = np.nanmean(array * mask_nan, axis = 1)
Omid
  • 51
  • 3