Extracting data from CMIP5-data (netCDF files) with python

Question

I have recieved the following code (down below) from my supervisor for extracting data from CMIP5-netCDF files (daily data) for multiple locations from multiple files. Unfortunately it is not working as it is supposed to and I am not that well versed in Python and the netCDF-format to solve the problems I encounter.

Description of the problem: The csv's that are created, start with "1850-01-01" and end with "1858-12-31" instead of (for the file used in this case) "1970-01-01" and "1979-12-31". From what I can understand, the problem probably lies in this part of the code

    time = data.variables['time']
    #saving the year which is written in the file
    year = time.units[11:15]
    #once we have acquired the data for one year then it will combine it for all the years as we are using for loop here
    all_years.append(year)

# Creating an empty Pandas DataFrame covering the whole range of data and then we will read the required data and put it here
year_start = min(all_years) 
end_year = str(int(min(all_years))+ 8)
#Here you need to add the number of years of data your have.
date_range = pd.date_range(start = str(year_start) + '-01-01', 
                           end = str(end_year) + '-12-31', 
                           freq = 'D')

where the variable "year" gets its value from "time.units[11:15]", which is 1850-01-01 - the reference date, from where the days are counted. Don't know if this helps. I also read that you can use the netCDF4 functions "num2date"/"date2num" to convert the content of the variable, which stores the relative time that passed since the reference date, into a "yyyy-mmmm-dd"/"dd-mmm-yyyy"-format, but I don't know how to do this and which variable holds this information. However I think there has to be another variable somewhere in the netCDF-file, which holds the real date starting with "1970-01-01", which then can be used to write in the csv. I would really appreciate if you could help me with this one!

I've created a google drive folder which contains some test files and the python script: https://drive.google.com/drive/folders/1dZwGiLG3V-wFJ7XiT0sWykChmUZmEFD2?usp=sharing.

Best regards, Alexander

PS: Sorry for the mess, I don't know how to edit and format the text in the right way.

#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset 
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np


# Record all the years of the netCDF files into a Python list
all_years = []

for file in glob.glob('*.nc'):
    print(file)
    #reading the files
    data = Dataset(file, 'r')
    #saving the data variable time
    time = data.variables['time']
    #saving the year which is written in the file
    year = time.units[11:15]
    #once we have acquired the data for one year then it will combine it for all the years as we are using for loop here
    all_years.append(year)

# Creating an empty Pandas DataFrame covering the whole range of data and then we will read the required data and put it here
year_start = min(all_years) 
end_year = str(int(min(all_years))+ 8)
#Here you need to add the number of years of data your have.
date_range = pd.date_range(start = str(year_start) + '-01-01', 
                           end = str(end_year) + '-12-31', 
                           freq = 'D')

#an empty having 0.0 values dataframe will be created with two columns date_range and temperature
df = pd.DataFrame(0.0, columns = ['Precipitation'], index = date_range)
    

# Defining the names, lat, lon for the locations of your interest into a csv file
#this will read the file locations
locations = pd.read_csv('stations_locations.csv')

#we would use a for loop as we are interested in aquiring all the information one by one from the rows
for index, row in locations.iterrows():
    # one by one we will extract the information from the csv and put it into temp. variables
    location = row['names']
    location_lat = row['latitude']
    location_lon = row['longitude']

# Sorting the all_years just to be sure that model writes the data correctly
    #all_years.sort()
    
    
    #now we will read the netCDF file and here I have used netCDF file from FGOALS model
    #for yr in all_years:
        # Reading-in the data : as there is only one file there is no need to use for loop
    data = Dataset('pr_day_inmcm4_historical_r1i1p1_19700101-19791231.nc', 'r')
        
    
        # Storing the lat and lon data of the netCDF file into variables 
    lat = data.variables['lat'][:]
    lon = data.variables['lon'][:]
        
        #as we already have the co-ordinates of the point which needs to be downloaded
        #in order to find the closest point around it we need to substract the cordinates
        #and check which ever has the minimun distance
        # Squared difference between the specified lat,lon and the lat,lon of the netCDF 
    sq_diff_lat = (lat - location_lat)**2 
    sq_diff_lon = (lon - location_lon)**2
    
        # Identify the index of the min value for lat and lon
    min_index_lat = sq_diff_lat.argmin()
    min_index_lon = sq_diff_lon.argmin()
    
        # Accessing the average temparature data
    temp = data.variables['pr']
    
        # Creating the date range for each year during each iteration
    start = str(year_start) + '-01-01'
    end = str(end_year) + '-12-31'
    d_range = pd.date_range(start = start, 
                            end = end, 
                            freq = 'D')
    
    for t_index in np.arange(0, len(d_range)):
        print('Recording the value for: ' + str(location)+'_'+ str(d_range[t_index]))
        df.loc[d_range[t_index]]['Precipitation'] = temp[t_index, min_index_lat, min_index_lon]

    df.to_csv(str(location) + '.csv')

It might be simpler to use xarray coupled with pandas for this. — Robert Wilson, Jun 25 '20 at 21:56
xarray should be able to decode the time for you, and you can just read all of the files into a single xarray object and convert that to a dataframe. I would try tweaking the code below: import xarray as xr import pandas as pd data = xr.open_dataset(glob.glob("*.nc")) df = data.to_dataframe() — Robert Wilson, Jun 25 '20 at 22:04

Extracting data from CMIP5-data (netCDF files) with python

0 Answers0