0

I have a netcdf file which contains a float array (21600, 43200). I don't want to read in the entire array to RAM because it's too large, so I'm using the Dataset object from the netCDF4 library to read in the array.

I would like to calculate the mean of a subset this array using two 1D numpy arrays (x_coords, y_coords) of 300-400 coordinates.

I don't think I can use basic indexing, because the coordinates I have aren't continuous. What I'm currently doing is just feeding the arrays directly into the object, like so:

ncdf_data = Dataset(file, 'r')
mean = np.mean(ncdf_data.variables['q'][x_coords, y_coords])

The above code takes far too long for my liking (~3-4 seconds depending on the coordinates I'm using), and I'd like to speed this up somehow. Is there a pythonic way that I can use to directly work out the mean from such a subset without triggering fancy indexing?

1 Answers1

1

I know h5py warns about the slow speed of fancy indexing,

docs.h5py.org/en/latest/high/dataset.html#fancy-indexing. 

netcdf probably has the same problem.

Can you load contiguous slice that contains all values, and apply the faster numpy advanced indexing to that subset? Or you may have to work with chunks.

numpy advanced indexing is slower than it's basic slicing, but that is still quite a bit faster than the fancy indexing directly off the file.

However you do it, np.mean will be operating on data in memory, not directly on data in the file. The slowness of fancy indexing is because it has to access data scattered through out the file. Loading the data into an array in memory isn't the slow part. The slow part is seeking and reading from the file.

Putting the file on a faster drive (e.g. a solid state one) might help.

hpaulj
  • 221,503
  • 14
  • 230
  • 353