How to extract data from netCDF to CSV using simple python script?

Question

This question is related to data extraction from NetCDF files (*.nc) to CSV, to be further used by researchers for other applications - statistical analysis along time series that are not array-like data. There are two alternatives for beginners in python, namely:

Script 1: I need your help in extracting climate date from an nc file to csv using python. The data founder, namely Copernicus gives some advice in how to extract data from nc to csv using simple python script. However, I have some issues with OverflowError: int too big to convert.

I will briefly describe the steps and provide all necessary info regarding data shape and content.

#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset 
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np

from matplotlib.dates import num2date

data = Dataset('prAdjust_tmean.nc')

And data looks like this:

print(data)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
    frequency: year
    CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
    creation_date: 2020-02-12T15:00:49ZCET+0100
    Conventions: CF-1.6
    institution_url: www.smhi.se
    invar_platform_id: -
    invar_rcm_model_driver: MPI-M-MPI-ESM-LR
    time_coverage_start: 1971
    time_coverage_end: 2000
    domain: EUR-11
    geospatial_lat_min: 23.942343
    geospatial_lat_max: 72.641624
    geospatial_lat_resolution: 0.04268074 degree
    geospatial_lon_min: -35.034023
    geospatial_lon_max: 73.937675
    geospatial_lon_resolution: 0.009246826 degree
    geospatial_bounds: -
    NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
    acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
    contact: Hydro.fou@smhi.se
    keywords: precipitation
    license: Copernicus License V1.2
    output_frequency: 30 year average value
    summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
    comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf,  https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
    history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
    invar_bc_institution: Swedish Meteorological and Hydrological Institute
    invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
    invar_bc_method_id: TimescaleBC v1.02
    invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
    invar_bc_observation_id: EFAS-Meteo
    invar_bc_period: 1990-2018
    data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
    institution: SMHI
    project_id: C3S_424_SMHI
    references: 
    source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
    invar_experiment_id: rcp45
    invar_realisation_id: r1i1p1
    invar_rcm_model_id: MPI-CSC-REMO2009-v1
    variable_name: prAdjust_tmean
    dimensions(sizes): x(1000), y(950), time(1), bnds(2)
    variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
    groups:

Extract the variable:

t2m = data.variables['prAdjust_tmean']

Get dimensions assuming 3D: time, latitude, longitude

time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
 
output_dir = './'

And the error:

OverflowError                             Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
      2 time_dim, lat_dim, lon_dim = t2m.get_dims()
      3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
      5 latitudes = data.variables[lat_dim.name][:]
      6 longitudes = data.variables[lon_dim.name][:]

C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
    509     if tz is None:
    510         tz = _get_rc_timezone()
--> 511     return _from_ordinalf_np_vectorized(x, tz).tolist()
    512 
    513 

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
   2106             vargs.extend([kwargs[_n] for _n in names])
   2107 
-> 2108         return self._vectorize_call(func=func, args=vargs)
   2109 
   2110     def _get_ufunc_and_otypes(self, func, args):

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
   2190                       for a in args]
   2191 
-> 2192             outputs = ufunc(*inputs)
   2193 
   2194             if ufunc.nout == 1:

C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
    329 
    330     dt = (np.datetime64(get_epoch()) +
--> 331           np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
    332     if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
    333         raise ValueError(f'Date ordinal {x} converts to {dt} (using '

OverflowError: int too big to convert

Last part of the script is:

import os
 
# Path
path = "/home"
 
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
 
 
# Path
path = "User/Documents"
 
# Join various path components
print(os.path.join(path, "/home", "file.txt"))

filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
    x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
    'time': [t.isoformat() for t in times_grid],
    'latitude': latitudes_grid,
    'longitude': longitudes_grid,
    't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')

Script 2:

The second script has a more useful approach, because it gives the user the opportunity to select the exact data from NetCDF file using python, rather than extracting all data. This script is written by GeoDelta Labs on youtube. However, as in most cases the script is "written" for a specific format of nc files, and needs further tweaking to read any nc file.

The code is like this:

import netCDF4
import glob
from netCDF4 import Dataset 
import pandas as pd
import numpy as np


# Record all the years of the netCDF files into a Python list
all_years = []

for file in glob.glob('name_of_the_file.nc'):
    print(file)
    data = Dataset(file, 'r')

data.variables.keys()
print(data.__dict__)

for file in glob.glob('name_of_the_file.nc'):
    print(file)
    data = Dataset(file, 'r')
    time = data.variables['time']
    year = time.units[14:18]
    all_years.append(year)

all_years
# Creating an empty Pandas DataFrame covering the whole range of data 
year_start = min(all_years) 
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01', 
                           end = str(end_year) + '-12-31', 
                           freq = 'Y')
df = pd.DataFrame(0.0, columns = ['Temperature'], index = date_range)

Now the interesting part of this second python script. Usually NetCDf files cover vast geographical areas (for example Nothern Hemisphere or entire continents), which is not so useful for particular statistical analysis. Let say that the user wants to extract climate data from nc files for particular locations - countries, regions, cities, etc. GeoDelta Lab comes with a very useful approach. It requires a small CSV file, created separately by the user, where three columns are created: Name (name of the region/country/location), Longitude and Latitude.

# Defining the location, lat, lon based on the csv data
cities = pd.read_csv('countries.csv')

cities
all_years

for index, row in cities.iterrows():
    location = row['names']
    location_latitude = row['lat']
    location_longitude = row['long']

    # Sorting the all_years python list
    all_years.sort()
    
    for yr in all_years:
        # Reading-in the data 
        data = Dataset('prAdjust_tmean_abs_QM-EFAS-Meteo-EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_MPI-CSC-REMO2009-v1_na_1971-2000_grid5km_v1.nc', 'r')
        
        # Storing the lat and lon data of the netCDF file into variables 
        lat = data.variables['lat'][:]
        lon = data.variables['lon'][:]
        
        # Squared difference between the specified lat,lon and the lat,lon of the netCDF 
        sq_diff_lat = (lat - location_latitude)**2 
        sq_diff_lon = (lon - location_longitude)**2
        
        # Identify the index of the min value for lat and lon
        min_index_lat = sq_diff_lat.argmin()
        min_index_lon = sq_diff_lon.argmin()
        
        # Accessing the average temparature data
        temp = data.variables['prAdjust_tmean']
        
        # Creating the date range for each year during each iteration
        start = str(yr) + '-01-01'
        end = str(yr) + '-12-31'
        d_range = pd.date_range(start = start, 
                                end = end, 
                                freq = 'D')
        
        for t_index in np.arange(0, len(d_range)):
            print('Recording the value for '+ location+': ' + str(d_range[t_index]))
            df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
    
    df.to_csv(location +'.csv')

The last part combines the data from user-defined CSV (with exact locations) and data extracted from nc file, to create a new CSV with the exact data needed by the user. However, like always the second script fails to "read" the years from my nc file and finds only the starting year 1970. Can someone explain what I am doing wrong? Could be that the nc file contains only data for 1970? In the description of the data - it is clearly specified that the file contains annual averages from 1970 until 2000.

If someone can help me to solve this issue, I guess this script can be useful for all future users of netCDF files. Many thanks for your time.

Edit: @Robert Davy requested a sample data. I've added the link to Google Drive, where the original nc files are stored and could be downloaded/viewed by anyone that accesses this link: https://drive.google.com/drive/folders/11q9dKCoQWqPE5EMcM5ye1bNPqELGMzOA?usp=sharing

May I ask how you intend to use the csv file ? netcdf has a lot of advantages for working with spatial / temporal data. — Robert Davy, Jan 28 '22 at 00:03
@RobertDavy The data are esential for regression based models along other time series which cannot be used in spatial data. — Marian, Jan 28 '22 at 01:23
What makes you believe that an intermediate csv file is necessary for this step? Not trying to be difficult - just interested to know the logic. — Robert Davy, Jan 28 '22 at 02:20
@RobertDavy please do understand that I need the data, extracted per country, per year, each variable separate to be used along other variables (non-spatial, non-array, non-3d) to run different models in statistical softwares other than python. Many thanks for understanding, and if you can help me to extract it would be more than helpful. — Marian, Jan 28 '22 at 02:54
Regarding this particular overflow error, I can only guess that your time units aren't what you think they are. It is hard to tell without seeing the data. — Robert Davy, Jan 30 '22 at 22:26
@RobertDavy but every info regarding the data is presented in print(data). I don't know what more can I show. If you want I can send a small nc file - it has only 11 Mb of data. — Marian, Jan 31 '22 at 04:37
I would strongly recommend using xarray (xarray.pydata.org) for NetCDF data manipulation in python — igrolvr, Jan 31 '22 at 21:55
@Marian supplying the data is part of creating a minimal reproducible example for others to follow. Your print(data) doesn't show what units the time are in. For example, could be seconds since 1970 or days since 1900. This may be why the chosen date conversion routine is not working. — Robert Davy, Jan 31 '22 at 22:06
@RobertDavy please see the edit. There is a link to Google Drive where the sample data are stored. Anyone willing to help and/or see the data can download from there. — Marian, Feb 01 '22 at 16:04
With `xarray` you can run fast data selection and transformation to a dataframe in just two or three lines — dl.meteo, Feb 02 '22 at 08:37
@Marian there are a few issues at work here, I am going to type out a few notes in the comments - — Robert Davy, Feb 02 '22 at 21:46
1. The overflow error is solved using `times = num2date(time_var[:]/24/3600)` because function assumes time is in days not seconds. — Robert Davy, Feb 02 '22 at 21:46
2. The following lines also generate errors, because latitude and longitude are actually not dimensions in this case. They are projected from dimensions 'x' and 'y' into coordinate variables. — Robert Davy, Feb 02 '22 at 21:46
3. This data file doesn't seem to give you what you need, i.e. monthly time series. It is the average over all months, hence only one time step is in the file. — Robert Davy, Feb 02 '22 at 21:46
4. Python (preferably with `xarray`) could do most of the statistical analysis you need. However, if R is your primary language, I suggest working in R and using the `ncdf4` package. Either way, learn about selecting, indexing and slicing the data in your chosen language. — Robert Davy, Feb 02 '22 at 21:47
5. If you insist on export and don't want to spend a lot of time coding, I suggest using the [cdo](https://code.mpimet.mpg.de/projects/cdo) tool. You can export this file using the following `cdo -selgrid,2 prAdjust_tmean.nc tmp.nc` then `cdo -outputtab,time,lon,lat,value tmp.nc` The selgrid command is needed first because you have a second "grid" related to time upper and lower bounds (denotes start end end of month). — Robert Davy, Feb 02 '22 at 21:47
@Marian sorry my error. However, the comment still applies. The data files don't contain annual data, just a single mean over several years. — Robert Davy, Feb 04 '22 at 06:10

How to extract data from netCDF to CSV using simple python script?

0 Answers0