Handling very large netCDF files in python

Question

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.

For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:

import netCDF4 as nc

f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]

But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.

Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.

If you want to do more with the data than just transposing it, have a look at the [`xarray`](http://xarray.pydata.org/en/stable/) module: It provides a very nice interface to [`dask`](http://dask.pydata.org/en/latest/) out-of-memory arrays. — j08lue, Apr 25 '16 at 06:32

score 8 · Accepted Answer · answered Aug 22 '12 at 22:42

8

You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:

http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html

The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles) you want for the variables. You can specify how much memory to use as a buffer and how much to use for chunk caches, but it's not clear how to use memory optimally between these uses, so you may have to just try some examples and time them. Rather than completely transpose a variable, you probably want to "partially transpose" it, by specifying chunks that have a lot of data along the 2 big dimensions of your slice and have only a few values along the other dimensions.

answered Aug 22 '12 at 22:42

Russ Rew

96
1

Thanks Russ for your answer. It was very interesting as I never looked much into the chunking. Assuming I have a variable with dimensions (500, 500, 300, 400). If I do a chunking of 1 on the third dimension, is this analogous to doing a partial transpose where that axis is the fastest-one (ie, contiguous)? I did change the chunking in the axis that I was going to read more, but it still takes a really long time just to get a 3D slice. I will investigate if this is a filesystem/network issue. – tiago Aug 23 '12 at 07:17
No, making the chunk length in the 3rd dimension 1 is makes that dimension the slowest, as you would access a 400 MB chunk for each 4-byte value when reading along that dimension. But if you used 10 chunks along each dimension (each chunk 50x40x30x40), each chunk would comprise about 12 MB (assuming 4 bytes per value), and it would only take 10 reads to access a "cylinder" of values along any dimension (a 50x50x30x40 chunk). For an example of how this can improve access times in some directions, see the 2 slides: http://www.unidata.ucar.edu/netcdf/workshops/2011/chunk_cache/Problem.html – Russ Rew Aug 27 '12 at 15:04
Correction to above comment: replace "(a 50x50x30x40 chunk)" with "(10 50x50x30x40 chunks)" ... – Russ Rew Aug 27 '12 at 15:40
I'm a bit confused. Assuming the (500, 500, 300, 400) variable size, I want fast access to slices like (:, :, 0, 0). I thought that chunking with 1 in the last two dimensions would be the best thing (other than to transpose the whole thing). What is the best chunking for that kind of access? In your link it says that rechunking with a large values of the first dimension and smaller values of the last dimensions would speed up access for those last dimensions, but you seem to be saying the opposite. – tiago Aug 27 '12 at 17:33

score 3 · Answer 2 · answered Aug 22 '12 at 13:36

3

This is a comment, not an answer, but I can't comment on the above, sorry.

I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:

for i in range(450):
    myslice = myvar[:,:,i]
    do_something(slice)

and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.

answered Aug 22 '12 at 13:36

gg349

21,996
5
54
64

Thank you for your answer. I have compared accessing `myvar[:,:,0:n]` and it does take about the same time as `myvar[:,:,0]`. So this is at least a way, but I am still trying to find out why there is such a penalty to start with. Note that `myvar[:,:,0:n]` is not contiguous. – tiago Aug 22 '12 at 23:56
Well, it is true that `myvar[1,0,0]` is not contiguous to `myvar[2,0,0]`. But it takes about the same time because `myvar[i,i,0]` is actually contiguous to `myvar[i,i,1]`. Does it make more sense now? – gg349 Aug 23 '12 at 08:58

Handling very large netCDF files in python

2 Answers2