3

This question is related to this other one (and maybe also this) in which a binned mean is calculated. I tried to do the same, using scipy-v1.5.2 and v1.7.3 but ignoring nan values (if not, some binned_mean values will return as nan values; if I replace nan, e.g., by 0s, the x-axis will be correct but the mean won't).

The reason behind this is dealing with data (containing nan values) like the represented in this scatter plot:

data set of interest

My problem is that whenever I compute a binned mean (also tweaking bins) the results are mainly nan values because my data contains intervals of nans that I don't really want to discard. Here's some code that reproduce the problem:

from scipy.stats import binned_statistic
import numpy as np

#defining a generic data similar to mine

a=np.array([0.1, 0.15, 0.17, 0.2, 0.3, 0.4, np.nan, 0.12,  0.15, 0.17, 0.22, np.nan, 0.37, np.nan, 0.12,  0.15, 0.17, 0.17, 0.35, 0.42, np.nan])
b=np.linspace(1,len(a),len(a))
plt.scatter(b,a)
plt.hlines(np.nanmean(a),b[0],b[-1], linestyles='--')
plt.show() #you can uncomment this line to separate the plots.

#filtering nans

nana= a[~np.isnan(a)]
nanb= np.linspace(1,len(nana),len(nana))
plt.scatter(nanb,nana, marker='o')
plt.hlines(np.nanmean(a),b[0],b[-1], linestyles='--')

#calculating a binned mean
bmean_nana = binned_statistic(nanb, nana, 
                             statistic='mean', 
                             bins=3, 
                             range=(0, len(nana)))

bin_centers = bmean_nana.bin_edges[1:] - (abs(bmean_nana.bin_edges[0]-bmean_nana.bin_edges[1]))/2
plt.scatter(bin_centers, bmean_nana.statistic, marker='x', s=90)

with output:

results

when binned means are plotted together with a data, you could see that the calculation is not matching (which make sense). Question 1: Is there a simple-general solution to perform a binned mean ignoring nan values?

A few notes:

  • here there's a binning algorithm that has a nanHandling : None option

‘ignore’: In this case, NaNs contained in the input data are removed from the data prior binning. Note however, that x0, unless specified explicitly, will still refer to the first data point, whether or not this holds a NaN value.

  • Also there's an issue reported in GitHub: "scipy.stats.binned_statistic regressed in v1.4.0 #11365" in which barentsen reports that

PR #10664 recently changed the behavior of scipy.stats.binned_statistic to raise a ValueError whenever the data contains a non-finite number (e.g., nan, inf). This new behavior is counter-intuitive because many other statistical methods do not raise exceptions in the presence of NaNs, e.g. np.mean([np.nan]) returns nan rather than raising an exception.

and

(...) A user could choose to ignore the NaNs by providing a custom statistic function, e.g. np.nanmean:

>>> import scipy.stats, numpy as np
>>> x = [0.5, 0.5, 1.5, 1.5]
>>> values = [10, 20, np.nan, 40]
>>> scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic

array([15., 40.])

As of SciPy v1.4.0, the examples above raise this ValueError

>>> scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gb/bin/anaconda/envs/sp14/lib/python3.7/site-packages/scipy/stats/_binned_statistic.py", line 180, in binned_statistic
    [x], values, statistic, bins, range)
  File "/home/gb/bin/anaconda/envs/sp14/lib/python3.7/site-packages/scipy/stats/_binned_statistic.py", line 519, in binned_statistic_dd
    raise ValueError('%r or %r contains non-finite values.' % (sample, values,))
ValueError: [[0.5, 0.5, 1.5, 1.5]] or [10, 20, nan, 40] contains non-finite values.

Question 2: is there a simple way to use statistic=np.nanmean ?

nuwe
  • 177
  • 9
  • 1
    Do you know what version of scipy you're using? Since the github issue is closed, perhaps that means that newer versions of `binned_statistic` have fixed the bug. – Kevin Dec 06 '21 at 19:51
  • Sure @Kevin. I've tried with `1.5.2` and `1.7.3` (upgraded just today using `pip`). None of the versions solved the issue :-/ – nuwe Dec 06 '21 at 19:57
  • maybe `values = [0 if x is np.nan else x for x in values]`? – Christian Dec 06 '21 at 20:01
  • @Christian this will replace every `nan` by a 0. When calculating the mean the result will be affected. If, instead, there's a way to handle `nan`s, the mean should not change. – nuwe Dec 06 '21 at 20:14
  • 1
    Maybe it hasn't made it into a public release yet... I will check my own copy. – Kevin Dec 06 '21 at 20:19
  • 1
    @nuwe I'm the one who introduced the regression and wrote the fix you linked to in this question. That was fixed in scipy version 1.5.0, I'm running 1.5.4 locally and that does correctly take `statistic=np.nanmean` as an argument – Lucas Roberts Dec 06 '21 at 22:41

1 Answers1

1

The answer to both questions 1 and 2 is to use np.nanmean to ignore the nans in the data. The regression you link to was a bug that I inadvertently introduced and then fixed after it was raised. I'm not sure why you have SciPy 1.5.2 in your environment, it looks like 1.5.4 is the latest 1.5.X version so you probably want to update the environment you're using. However, that backport was applied to version 1.5.0 release versions so there should not be an issue on those versions if you have the latest.

Also, I set this up with scipy version 1.7.3 and that also works as intended for me. Here are snippets.

Version 1.5.4

import scipy
scipy.__version__
'1.5.4'
import numpy as np 
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Version 1.7.3

import scipy
scipy.__version__
'1.7.3'
import scipy.stats, numpy as np
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Version 1.5.2

import scipy
scipy.__version__
'1.5.2'
import scipy.stats, numpy as np
x = [0.5, 0.5, 1.5, 1.5]
values = [10, 20, np.nan, 40]
scipy.stats.binned_statistic(x, values, statistic='mean', bins=(0, 1, 2)).statistic
array([15., nan])
scipy.stats.binned_statistic(x, values, statistic=np.nanmean, bins=(0, 1, 2)).statistic
array([15., 40.])

Please try using the .__version__ to confirm your version of SciPy. If it is a 1.4.1 version-it does not have the change. I suspect that your version is 1.4.1 and not the other (higher) versions. Please use the portions of the code examples above to confirm the scipy version that is used in your environment.

Lucas Roberts
  • 1,252
  • 14
  • 17
  • Oops... a restart after upgrading was enough. I apologize for the false alarm. I was using (outputs of `.__version__`) `1.5.2` and upgraded to `1.7.3`. Now the code examples above are working perfectly fine. Much thanks!! – nuwe Dec 07 '21 at 03:23
  • 1
    @nuwe no worries, we've all had that happen before :) glad you got it fixed! – Lucas Roberts Dec 07 '21 at 22:24