17

In Numpy 1.4.1, what is the simplest or most efficient way of calculating the histogram of a masked array? numpy.histogram and pyplot.hist do count the masked elements, by default!

The only simple solution I can think of right now involves creating a new array with the non-masked value:

histogram(m_arr[~m_arr.mask])

This is not very efficient, though, as this unnecessarily creates a new array. I'd be happy to read about better ideas!

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 2
    For what it's worth, this would probably be considered a bug in `numpy.histogram`. You should probably file a bug report and mention it on the mailing list. It's easily fixed by replacing `asarray` with `asanyarray` in the `numpy.histogram` sources. – Joe Kington Aug 31 '10 at 14:55
  • Joe, you might want to submit your comment as an answer: I might well mark it as the accepted answer, if confirmed by the Numpy developers. – Eric O. Lebigot Sep 02 '10 at 07:41
  • 3
    I sent out a quick question to the list. http://mail.scipy.org/pipermail/numpy-discussion/2010-September/052575.html We'll see whether or not folks consider it a bug or not. It seems counter intuitive to me at the very least, though. – Joe Kington Sep 02 '10 at 19:56
  • 2
    For what it's worth, the general consensus was that it was intended behavior, and that such a fix would probably cause more problems than it would fix. E.g.: http://mail.scipy.org/pipermail/numpy-discussion/2010-September/052578.html – Joe Kington Sep 02 '10 at 23:36
  • Thank you, Joe. Can you summarize your comments in an answer. I'd like to mark it as the accepted answer because it shows that there is nothing better than tillsten's good solution. – Eric O. Lebigot Sep 03 '10 at 08:09
  • Quite a few numpy/scipy functions misbehave on masked arrays or NaNs; how about a SO community wiki to collect code snippets in one place ? – denis Sep 04 '10 at 16:46
  • @EOL - Done! Sorry, I didn't notice your message until now... I've been out of town for a few days. Thanks, by the way! – Joe Kington Sep 07 '10 at 01:33
  • @Denis - Seems like a reasonable idea to me... Feel free to start it if you like! Keep in mind that "misbehave" can be a bit subjective when it comes to handling masked arrays and NaN's. Either way, a lot of the default behavior can be counter-intuitive, i.m.o. – Joe Kington Sep 07 '10 at 01:36

4 Answers4

16

(Undeleting this as per discussion above...)

I'm not sure whether or not the numpy developers would consider this a bug or expected behavior. I asked on the mailing list, so I guess we'll see what they say.

Either way, it's an easy fix. Patching numpy/lib/function_base.py to use numpy.asanyarray rather than numpy.asarray on the inputs to the function will allow it to properly use masked arrays (or any other subclass of an ndarray) without creating a copy.

Edit: It seems like it is expected behavior. As discussed here:

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant, because I guess full masked array handling inside histogram will be a lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be necessary to figure out what the masked array definition of a histogram is, as Bruce pointed out.

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • Thank you. One of the arguments against handling masked arrays in histograms is that if histograms handled masked values, one would have to decide how masked data with a masked array of weights should be treated. I don't think that there is any obviously better solution to this problem: it looks like `histogram()`'s features do not mix too well with masked input+weight arrays. – Eric O. Lebigot Sep 07 '10 at 07:12
9

Try hist(m_arr.compressed()).

tillsten
  • 14,491
  • 5
  • 32
  • 41
  • 1
    This is a better idea than my `m_arr[~m_arr.mask]`. However, it does not solve the problem that a new array is unnecessarily corrected. – Eric O. Lebigot Sep 02 '10 at 07:40
6

This is a super old question, but these days I just use:

numpy.histogram(m_arr, bins=.., range=.., density=False, weights=m_arr_mask)

Where m_arr_mask is an array with the same shape as m_arr, consisting of 0 values for elements of m_arr to be excluded from the histogram and 1 values for elements that are to be included.

Erik Hvatum
  • 301
  • 3
  • 7
  • Also, this won't work if you try to pass in a string for `bins`. Great answer aside from that. – Mad Physicist Mar 18 '16 at 12:34
  • I can't seem to make it work. When I pass a mask array for weights, the result does not seem consistent with the result I get without mask. I tried passing a mask of random 0 and 1 values, and expected the count in each bin to be divided by approx 2. But it gets divided by more than 20. – PiRK Jun 16 '20 at 10:18
  • See https://filebin.net/jp4x16ekgiyuupgu/Untitled.html?t=2w07ir8v – PiRK Jun 16 '20 at 10:18
  • I works fine with a more trivial example (much smaller arrays). Looks like a numpy bug, maybe the weights cause numpy.histogram to use a uint8 array as output, which causes overflow. – PiRK Jun 16 '20 at 10:38
0

After running into casting issues by trying Erik's solution (see https://github.com/numpy/numpy/issues/16616), I decided to write a numba function to achieve this behavior.

Some of the code was inspired by https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html. I added the mask bit.

import numpy
import numba  

@numba.jit(nopython=True)
def compute_bin(x, bin_edges):
    # assuming uniform bins for now
    n = bin_edges.shape[0] - 1
    a_min = bin_edges[0]
    a_max = bin_edges[-1]

    # special case to mirror NumPy behavior for last bin
    if x == a_max:
        return n - 1  # a_max always in last bin

    bin = int(n * (x - a_min) / (a_max - a_min))

    if bin < 0 or bin >= n:
        return None
    else:
        return bin


@numba.jit(nopython=True)
def masked_histogram(img, bin_edges, mask):
    hist = numpy.zeros(len(bin_edges) - 1, dtype=numpy.intp)

    for i, value in enumerate(img.flat):
        if mask.flat[i]:
            bin = compute_bin(value, bin_edges)
            if bin is not None:
                hist[int(bin)] += 1
    return hist  # , bin_edges

The speedup is significant. On a (1000, 1000) image:

enter image description here

PiRK
  • 893
  • 3
  • 9
  • 24