Combining a large amount of netCDF files

Question

I have a large folder of netCDF (.nc) files each one with a similar name. The data files contain variables of time, longitude, latitude, and monthly precipitation. The goal is to get the average monthly precipitation over X amount of years for each month. So in the end I would have 12 values representing the average monthly precipitation over X amount of years for each lat and long. Each file is the same location over many years. Each file starts with the same name and ends in a “date.sub.nc” for example:

'data1.somthing.somthing1.avg_2d_Ind_Nx.200109.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.200509.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201104.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201004.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201003.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201103.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.201203.SUB.nc'

The ending is YearMonth.SUB.nc What I have so far is:

array=[]
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']

I get a KeyError: ('time', 'longitude', 'latitude', 'prectot'). Is there a way to combine all this data so I am able to manipulate it?

What do you mean by "combine" the data? It's already all in a single MFDataset object thanks to your `f = nc.MFDataset...` line. In other words, the `f.variables['prectot'][:]` array is already a 3-D array with dimensions (time, latitude, longitude) containing the `prectot` values for each (time, latitude, longitude). — Spencer Hill, Mar 10 '15 at 13:18
Also, re: your KeyError, `f.variables` is a Dict, and for any Dict you can only access one of its values at a time, i.e. `f.variables['time']` or `f.variables['longitude']`, not `f.variables['time', 'longitude']`. But as my previous comment said, all you need anyways is `f.variables['prectot']` (provided I'm understanding you correctly). — Spencer Hill, Mar 10 '15 at 13:29
I see, I was unsure on what MFDataset actually did. I tried glob.glob function but that just made a list of all my files. Thanks. — BBHuggin, Mar 10 '15 at 14:36

N1B4 · Accepted Answer · 2015-03-12T04:17:27.083

5

As @CharlieZender mentioned, ncra is the way to go here and I'll provide some more details on integrating that function into a Python script. (PS - you can install NCO easily with Homebrew, e.g. http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/)

import subprocess
import netCDF4
import glob
import numpy as np

for month in range(1,13):
    # Gather all the files for this month
    month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))


    # Using NCO functions ---------------
    avg_file = './precip_avg_{0:0>2d}.nc'.format(month)

    # Concatenate the files using ncrcat
    subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])

    # Take the time (record) average using ncra 
    subprocess.call(['ncra', avg_file, '-O', avg_file])

    # Read in the monthly precip climatology file and do whatever now
    ncfile = netCDF4.Dataset(avg_file, 'r')
    pr = ncfile.variables['prectot'][:,:,:]
    ....

    # Using only Python -------------
    # Initialize an array to store monthly-mean precip for all years
    # let's presume we know the lat and lon dimensions (nlat, nlon)
    nyears = len(month_files)
    pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')

    # Populate pr_arr with each file's monthly-mean precip
    for idx, filename in enumerate(month_files):
        ncfile = netCDF4.Dataset(filename, 'r')
        pr = ncfile.variable['prectot'][:,:,:]  
        pr_arr[idx,:,:] = np.mean(pr, axis=0)
        ncfile.close()

    # Take the average along all years for a monthly climatology
    pr_clim = np.mean(pr_arr, axis=0)  # 2D now [lat,lon]

edited Mar 12 '15 at 04:17

answered Mar 10 '15 at 18:03

N1B4

3,377
1
21
24

You can use pure Python to achieve the same result, NCO just greatly helps compress the time/memory demands. I'll edit my answer to show you how to use only Python. – N1B4 Mar 11 '15 at 23:36
Two things: One, the for loop "for month in range(1,13):" is only taking the December month files. When changing the range from 1,13 to 1,12 it only takes the November month files. Would a another for loop above the fist one take care of this issue? Second, once I get into the "populate pr_arr" for loop it does not like the pr_arr[idx,:,:] =np.mean(pr, axis=0) saying "ValueError: could not broadcast input array from shape (5,5) into shape (0,0)". Another question is what does "{0:0>2d}" actually do? – BBHuggin Mar 12 '15 at 23:31
1

Strange about the for loop not working, it does on my end. It should first pick out all January (month=1) files, then move on to the next month. Try printing information after the for loops to help diagnose the issue. What is the shape of `prectot` to begin with? Make sure you're inserting the dimension sizes of `nlat` and `nlon` into the initialization of pr_arr. '{0:0>2d} pads integers with a leading 0, e.g. 1 becomes 01, which is how the filenames are structured. Check out more info on string formatting for extra details on this. – N1B4 Mar 13 '15 at 00:23
I fixed the problem with the loop. I added "append" function so now it looks like this. month_files.append(glob.glob(*{0:0>2d}.SUB.nc'.format(month))). I saw when printing it would print all the file months 1 through 12 but seemed to only input the last set (December-12) into month_files. When you say assume the lat and lon does that mean I need to make an new array with the same lat and lons within my .nc files? prectot has a shape of (396L, 5L, 5L). – BBHuggin Mar 13 '15 at 00:39
You just need to insert 5 where `nlat` and `nlon` are during the line where `pr_arr` is defined. BTW, if the files are storing monthly-means to begin with, then the shape of `prectot` should be (1 x lat x lon)...I'm not sure where the 396 is coming from. In that case, you don't need to compute a monthly mean as I've shown above. Just populate `pr_arr[idx,:,;]` with the monthly-mean precip from each month's file, i.e. `pr_arr[idx,:,:] = pr[0,:,:]` – N1B4 Mar 13 '15 at 23:45
The 396 is 33 years of data. (ie 396/12 = 33). Line with [ncfile = netCDF4.Dataset(filename, 'r')] is having issue. Produces 'TypeError: expected string or Unicode object, list found'. – BBHuggin Mar 14 '15 at 21:26

score 3 · Answer 2 · answered Mar 10 '15 at 12:14

3

NCO does this with

ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
...
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc

Of course the first twelve commands can be done with a Bash loop, reducing the total number of lines to less than five. If you prefer to script with python, you can check your answers with this. ncra docs here.

answered Mar 10 '15 at 12:14

Charlie Zender

5,929
14
19

I am using Enthought Canopy for python and I could not find or download the package NCO to get the ncra function. I would like to use it though. – BBHuggin Mar 10 '15 at 14:42
I have installed NCO I think. when I run ncra *01.SUB.nc pcp_avg_01.nc it is outputting {SyntaxError: invalid syntax ncra *01.SUB.nc pcp_avg_01.nc} with the Caret pointing to S in SUB.nc not sure how to fix this. – BBHuggin Mar 16 '15 at 19:04
You are running on Windows so shell globbing does not work. Replace *01.SUB.nc by the actual list of files to input. The manual can be helpful. See http://nco.sourceforge.net/nco.html#Specifying-Input-Files – Charlie Zender Mar 17 '15 at 20:12

ClimateUnboxed · Answer 3 · 2017-09-08T13:55:43.340

The command ymonmean calculates the mean of calendar months in CDO. Thus the task can be accomplished in two lines:

cdo mergetime data*.SUB.nc  merged.nc  # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.

cdo can also calculate the annual cycle of other statistics, ymonstd, ymonmax etc... and the time units can be days or pentads as well as months. (e.g. ydaymean).

Combining a large amount of netCDF files

3 Answers3

Linked