I need to efficiently process very large 1D arrays extracting some statistics per bin and I have found very useful the function binned_statistic from scipy.stats as it includes a 'statistic' argument that works quite efficiently.
I would like to perform a 'count' function but without considering zero values.
I am working in parallel with sliding windows (pandas rolling function) over the same arrays and it work nicely to substitute zeroes to NaN, but this behavior is not shared to my case.
This is a toy example of what I am doing:
import numpy as np
import pandas as pd
from scipy.stats import binned_statistic
# As example with sliding windows, this returns just the length of each window:
a = np.array([1., 0., 0., 1.])
pd.Series(a).rolling(2).count() # Returns [1.,2.,2.,2.]
# You can make the count to do it only if not zero:
nonzero_a = a.copy()
nonzero_a[nonzero_a==0.0]='nan'
pd.Series(nonzero_a).rolling(2).count() # Returns [1.,1.,0.,1.]
# However, with binned_statistic I am not able to do anything similar:
binned_statistic(range(4), a, bins=2, statistic='count')[0]
binned_statistic(range(4), nonzero_a, bins=2, statistic='count')[0]
binned_statistic(range(4), np.array([1., False, None, 1.], bins=2, statistic='count')[0]
All the previous runs provide the same output: [2., 2.] but I am expecting [1., 1.].
The only option found is to pass a custom function but it performs considerably worst than the implemented functions with real cases.
binned_statistic(range(4), a, bins=2, statistic=np.count_nonzero)