This question is related to data extraction from NetCDF files (*.nc) to CSV, to be further used by researchers for other applications - statistical analysis along time series that are not array-like data. There are two alternatives for beginners in python, namely:
Script 1:
I need your help in extracting climate date from an nc file to csv using python. The data founder, namely Copernicus gives some advice in how to extract data from nc to csv using simple python script. However, I have some issues with OverflowError: int too big to convert
.
I will briefly describe the steps and provide all necessary info regarding data shape and content.
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset('prAdjust_tmean.nc')
And data looks like this:
print(data)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
frequency: year
CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
creation_date: 2020-02-12T15:00:49ZCET+0100
Conventions: CF-1.6
institution_url: www.smhi.se
invar_platform_id: -
invar_rcm_model_driver: MPI-M-MPI-ESM-LR
time_coverage_start: 1971
time_coverage_end: 2000
domain: EUR-11
geospatial_lat_min: 23.942343
geospatial_lat_max: 72.641624
geospatial_lat_resolution: 0.04268074 degree
geospatial_lon_min: -35.034023
geospatial_lon_max: 73.937675
geospatial_lon_resolution: 0.009246826 degree
geospatial_bounds: -
NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
contact: Hydro.fou@smhi.se
keywords: precipitation
license: Copernicus License V1.2
output_frequency: 30 year average value
summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf, https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
invar_bc_institution: Swedish Meteorological and Hydrological Institute
invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
invar_bc_method_id: TimescaleBC v1.02
invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
invar_bc_observation_id: EFAS-Meteo
invar_bc_period: 1990-2018
data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
institution: SMHI
project_id: C3S_424_SMHI
references:
source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
invar_experiment_id: rcp45
invar_realisation_id: r1i1p1
invar_rcm_model_id: MPI-CSC-REMO2009-v1
variable_name: prAdjust_tmean
dimensions(sizes): x(1000), y(950), time(1), bnds(2)
variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
groups:
Extract the variable:
t2m = data.variables['prAdjust_tmean']
Get dimensions assuming 3D: time, latitude, longitude
time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
output_dir = './'
And the error:
OverflowError Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
2 time_dim, lat_dim, lon_dim = t2m.get_dims()
3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
5 latitudes = data.variables[lat_dim.name][:]
6 longitudes = data.variables[lon_dim.name][:]
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
509 if tz is None:
510 tz = _get_rc_timezone()
--> 511 return _from_ordinalf_np_vectorized(x, tz).tolist()
512
513
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
2106 vargs.extend([kwargs[_n] for _n in names])
2107
-> 2108 return self._vectorize_call(func=func, args=vargs)
2109
2110 def _get_ufunc_and_otypes(self, func, args):
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2190 for a in args]
2191
-> 2192 outputs = ufunc(*inputs)
2193
2194 if ufunc.nout == 1:
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
329
330 dt = (np.datetime64(get_epoch()) +
--> 331 np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
332 if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
333 raise ValueError(f'Date ordinal {x} converts to {dt} (using '
OverflowError: int too big to convert
Last part of the script is:
import os
# Path
path = "/home"
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
# Path
path = "User/Documents"
# Join various path components
print(os.path.join(path, "/home", "file.txt"))
filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
'time': [t.isoformat() for t in times_grid],
'latitude': latitudes_grid,
'longitude': longitudes_grid,
't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')
Script 2:
The second script has a more useful approach, because it gives the user the opportunity to select the exact data from NetCDF file using python, rather than extracting all data. This script is written by GeoDelta Labs on youtube. However, as in most cases the script is "written" for a specific format of nc files, and needs further tweaking to read any nc file.
The code is like this:
import netCDF4
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
data.variables.keys()
print(data.__dict__)
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[14:18]
all_years.append(year)
all_years
# Creating an empty Pandas DataFrame covering the whole range of data
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'Y')
df = pd.DataFrame(0.0, columns = ['Temperature'], index = date_range)
Now the interesting part of this second python script. Usually NetCDf files cover vast geographical areas (for example Nothern Hemisphere or entire continents), which is not so useful for particular statistical analysis. Let say that the user wants to extract climate data from nc files for particular locations - countries, regions, cities, etc. GeoDelta Lab comes with a very useful approach. It requires a small CSV file, created separately by the user, where three columns are created: Name (name of the region/country/location), Longitude and Latitude.
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv('countries.csv')
cities
all_years
for index, row in cities.iterrows():
location = row['names']
location_latitude = row['lat']
location_longitude = row['long']
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset('prAdjust_tmean_abs_QM-EFAS-Meteo-EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_MPI-CSC-REMO2009-v1_na_1971-2000_grid5km_v1.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['prAdjust_tmean']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for '+ location+': ' + str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(location +'.csv')
The last part combines the data from user-defined CSV (with exact locations) and data extracted from nc file, to create a new CSV with the exact data needed by the user. However, like always the second script fails to "read" the years from my nc file and finds only the starting year 1970. Can someone explain what I am doing wrong? Could be that the nc file contains only data for 1970? In the description of the data - it is clearly specified that the file contains annual averages from 1970 until 2000.
If someone can help me to solve this issue, I guess this script can be useful for all future users of netCDF files. Many thanks for your time.
Edit: @Robert Davy requested a sample data. I've added the link to Google Drive, where the original nc files are stored and could be downloaded/viewed by anyone that accesses this link: https://drive.google.com/drive/folders/11q9dKCoQWqPE5EMcM5ye1bNPqELGMzOA?usp=sharing