9

I have a working python program that reads in a number of large netCDF files using the Dataset command from the netCDF4 module. Here is a snippet of the relevant parts:

from netCDF4 import Dataset
import glob

infile_root = 'start_of_file_name_'

for infile in sorted(glob.iglob(infile_root + '*')):
   ncin = Dataset(infile,'r')
   ncin.close()

I want to modify this to read in netCDF files that are gzipped. The files themselves were gzipped after creation; they are not internally compressed (i.e., the files are *.nc.gz). If I were reading in gzipped text files, the command would be:

from netCDF4 import Dataset
import glob
import gzip

infile_root = 'start_of_file_name_'

for infile in sorted(glob.iglob(infile_root + '*.gz')):
   f = gzip.open(infile, 'rb')
   file_content = f.read()
   f.close()

After googling around for maybe half an hour and reading through the netCDF4 documentation, the only way I can come up with to do this for netCDF files is:

from netCDF4 import Dataset
import glob
import os

infile_root = 'start_of_file_name_'

for infile in sorted(glob.iglob(infile_root + '*.gz')):
   os.system('gzip -d ' + infile)
   ncin = Dataset(infile[:-3],'r')
   ncin.close()
   os.system('gzip ' + infile[:-3]) 

Is it possible to read gzip files with the Dataset command directly? Or without otherwise calling gzip through os?

eclark
  • 481
  • 5
  • 14
  • The [Dataset docs](http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4.Dataset-class.html) don't specify anything about gz files, so I don't think it's supported. I'm sure there's a way of gunzipping them from within Python without a `gzip` system call, but I don't know what it would be. Why do you need it to be handled by Dataset? – Spencer Hill Dec 05 '14 at 18:29
  • I didn't want to make separate calls to unzip and then re-gzip the files. I also mostly want to avoid the system call. – eclark Dec 05 '14 at 18:34
  • 1
    `gzip.open` returns a file-like object that can be used for read-only access to the file - but it looks like netCDF4 doesn't support that. If it were me, I'd use the python gzip module to unzip to a temporary file and leave the original alone. If its something that needed to be done often, you could start maintaining a cache of the ones you've unzipped. If the files are modified, you'' need to unzip and rezip anyway, so what the heck. – tdelaney Dec 05 '14 at 18:43
  • 1
    @tdelaney - Thanks. Using the gzip module to unzip to temporary files is a great suggestion. – eclark Dec 05 '14 at 19:00

3 Answers3

9

Reading datasets from memory is supported since netCDF4-1.2.8 (Changelog):

import netCDF4
import gzip

with gzip.open('test.nc.gz') as gz:
    with netCDF4.Dataset('dummy', mode='r', memory=gz.read()) as nc:
        print(nc.variables)

See the description of the memory parameter in the Dataset documentation

sfinkens
  • 1,210
  • 12
  • 15
5

Because NetCDF4-Python wraps the C NetCDF4 library, you're out of luck as far as using the gzip module to pass in a file-like object. The only option is, as suggested by @tdelaney, to use the gzip to extract to a temporary file.

If you happen to have any control over the creation of these files, NetCDF version 4 files support zlib compression internally, so that using gzip is superfluous. It might also be worth converting the files from version 3 to version 4 if you need to repeatedly process these files.

DopplerShift
  • 5,472
  • 1
  • 21
  • 20
4

Since I just had to solve the same problem, here is a ready-made solution:

import gzip
import os
import shutil
import tempfile

import netCDF4

def open_netcdf(fname):
    if fname.endswith(".gz"):
        infile = gzip.open(fname, 'rb')
        tmp = tempfile.NamedTemporaryFile(delete=False)
        shutil.copyfileobj(infile, tmp)
        infile.close()
        tmp.close()
        data = netCDF4.Dataset(tmp.name)
        os.unlink(tmp.name)
    else:
        data = netCDF4.Dataset(fname)
    return data
jochen
  • 3,728
  • 2
  • 39
  • 49