2

I am having issues using Numpy histogram on a particular data set.

The issue is that I get a very slow response (several minutes) as well as very large memory usage. The memory behavior I noticed is a 12GB peak which then ramps down to ~750MB and then back up to the high GBs. This seems to repeat endlessly. Even if I let it run through. It takes multiple minutes and I get a Memory error at the end.

All this happens when passed a (very) small data set such as the one below (26 elements):

array(['2.400000024000011e-05', '2.4000000240000108e-05',
       '2.400000024000011e-05', '2.400000024000012e-05',
       '2.4000000240000105e-05', '2.4000000240000105e-05',
       '2.400000024000009e-05', '2.400000024000012e-05',
       '2.400000024000012e-05', '2.400002024000031e-05',
       '2.4000000240000145e-05', '2.400000024000012e-05',
       '2.400000024000012e-05', '2.4000000240000064e-05',
       '2.400000024000012e-05', '2.400000024000012e-05',
       '2.400000024000012e-05', '2.400000024000012e-05',
       '2.400000024000012e-05', '2.400000024000012e-05',
       '2.400000024000001e-05', '2.400000024000012e-05',
       '2.4000020240000364e-05', '2.400000024000012e-05',
       '2.400000024000012e-05', '2.400000024000012e-05'], dtype='float64')

I am assuming part of the slowdown could be due to reaching the physical memory cap and then being limited by swap time.

The histogram calculation is as follows:

histY, histX = np.histogram(vals, bins='auto')

Where '''vals''' is the example values in the Numpy array provided above

*Note the small min-max margin in the above case of 2.0000000353764813e-11

My quick guess; the histogram function is stuck doing some iterative optimization to find the best bin sizes vs bin count for this data set and is having issues with the small min-max margin.

The error I receive when it finally ends:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".....\lib\site-packages\numpy\lib\histograms.py", line 737, in histogram
    n = np.zeros(n_equal_bins, ntype)
MemoryError

Could someone please explain what is really happening here and what can be done to circumvent the issue?

The-Duck
  • 501
  • 5
  • 9
  • I can confirm this bug exists on 1.18.1. – John Zwinck Mar 12 '20 at 11:59
  • Have you tried with a fixed bins number instead of 'auto' to contrast timing? – alan.elkin Mar 12 '20 at 12:04
  • 2
    @alan.elkin: With fixed number of bins like 3, it completes instantly. As you'd expect it to even with `auto`. – John Zwinck Mar 12 '20 at 12:05
  • 2
    @JohnZwinck Thanks for confirming. I have found the following bug report on the numpy git: https://github.com/numpy/numpy/issues/10297 Looks like a two year-old bug. Looks like I'll have to use the fixed bins approach. – The-Duck Mar 12 '20 at 13:30

0 Answers0