(1) A more usual numpy idiom would be:
cube.data[cube.data < threshold_value] = 0.0
I think that should make some impression on the memory problem, as it doesn't compute an entire new floating-point array to assign back.
However, it does need to create a data-sized boolean array for cube.data < threshold_value
, so it might still not solve your problem.
(2) A really simple performance improvement could be to do this in sections, if you have a dimension you can slice over, such as a typical Z dimension with a few 10's of levels?
Then you can just divide the task, e.g., for a 4d cube of dims t,z,y,x :--
for i in range(nz):
part = cube.data[:, iz]
part[part < threshold_value] = 0.0
That should also work well if your cube already contains "real" rather than "lazy" data .
(3) However, I wonder if your key problem could be that fetching all the data at once is itself simply too big ?
That is perfectly possible in Iris, as it uses deferred loading : so, any reference to "cube.data" will fetch all the data into a real in-memory array, whereas e.g. simply saving the cube or calculating a statistic would be able to avoid that.
So, the usability of really big cubes critically depends on what you eventually do with the content.
Iris now has a much fuller account of this, in the docs for the forthcoming version 2.0 : https://scitools-docs.github.io/iris/master/userguide/real_and_lazy_data.html
For instance, with Dask in the upcoming iris v2, it will be possible to use dask to do this efficiently. Something like:
data = cube.lazy_data()
data = da.where(data < threshold_value, data, 0.0)
zapped_cube = cube.copy(data=data)
This makes a derived cube with a deferred data calculation. As that can be processed in "chunks" when its time comes, it can drastically reduce the memory usage.