Trouble with dimensions in netcdf : index exceeds dimension bounds

Question

I want to extract monthly temperature data from several netCDF files in different locations. Files are built as follows:

> print(data.variables.keys())
dict_keys(['lon', 'lat', 'time', 'tmp','stn'])

Files hold names like "tmp_1901_1910."

Here is the code I use:

import glob
import pandas as pd
import os
import numpy as np
import time 

 
os.chdir('PATH/data_tmp')
all_years = []

for file in glob.glob('*.nc'):
    data = Dataset(file,'r')
    time_data = data.variables['time'][:]
    time = data.variables['time']
    year =  str(file)[4:13]

    all_years.append(year)
   
# Empty pandas dataframe
year_start = min(all_years)
end_year = max(all_years)

date_range = pd.date_range(start = str(year_start[0:4]) + '-01-01', end = str(end_year[5:9]) + '-12-31', freq ='M')

df = pd.DataFrame(0.0, columns = ['Temp'], index = date_range)


# Defining the location, lat, lon based on the csv data 
cities = pd.read_csv(r'PATH/cities_coordinates.csv', sep =',')


cities['city']= cities['city'].map(str)


for index, row in cities.iterrows():
    location = row['code_nbs']
    location_latitude = row['lat']
    location_longitude = row['lon']
     
    # Sorting the list
    all_years.sort()
    
    for yr in all_years:
        #Reading in the data
        data = Dataset('tmp_'+str(yr)+'.nc','r')
        
        # Storing the lat and lon data into variables of the netCDF file into variables
        lat = data.variables['lat'][:]
        lon = data.variables['lon'][:]
    
        # Squared difference between the specified lat, lon and the lat, lon of the netCDF
        sq_diff_lat = (lat - location_latitude)**2
        sq_diff_lon = (lon - location_longitude)**2
        
        
        # Retrieving the index of the min value for lat and lon
        min_index_lat = sq_diff_lat.argmin()
        min_index_lon = sq_diff_lon.argmin()
            
        # Accessing the temperature data
        tmp  = data.variables['tmp']
        
        start = str(yr[0:4])+'-01-01'
        end = str(yr[5:11])+'-12-31'
        d_range = pd.date_range(start = start, end = end, freq='M')
        
        for t_index in np.arange(0, len(d_range)):
             print('Recording the value for: '+str(d_range[t_index]))
             df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
           
    df.to_csv(location +'.csv')

I obtain the following message while running the command df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]

IndexError: index exceeds dimension bounds

I inspect the object's values and have:

print(d_range)
DatetimeIndex(['1901-01-31', '1901-02-28', '1901-03-31', '1901-04-30',
               '1901-05-31', '1901-06-30', '1901-07-31', '1901-08-31',
               '1901-09-30', '1901-10-31',
               ...
               '1910-03-31', '1910-04-30', '1910-05-31', '1910-06-30',
               '1910-07-31', '1910-08-31', '1910-09-30', '1910-10-31',
               '1910-11-30', '1910-12-31'],
              dtype='datetime64[ns]', length=120, freq='M')

On the first t_index within the loop, I have:

print(t_index)
0

print(d_range[t_index])
1901-01-31 00:00:00

print(min_index_lat)
259
print(min_index_lon)
592

I don't understand what went wrong with the dimensions.

Thank you for any help!

It would help if you said what line was causing the error, and preferably made the problem reproducible. It seems impossible to provide an answer without more info — Robert Wilson, Sep 27 '22 at 16:10
This error occurs when you provide an index which is outside the allowed range of values in the file. Can you inspect the values of `d_range`, `d_range[t_index]`, and the indexers `min_index_lon, min_index_lat, t_index` to ensure they all fall within the bounds of df and tmp? if you think they do, can you print these out and copy the result into your question (as a code block), and also include the full traceback? it's always important to post the full traceback when asking questions about errors. thanks! — Michael Delgado, Sep 27 '22 at 16:57
I recommend checking out xarray and regridding packages such as xesmf. Those packages offer very easy and efficient ways to handle situations like this — Robert Wilson, Sep 27 '22 at 21:51
Thank you for your comment Robert, I am a beginner in python and not sure to understand what you recommend. I do not use xarray, why should I check out it ? — ele_al_12, Sep 28 '22 at 08:29
You do a lot of things while reading in the file(s). I suggest to split it up. First import your 'raw' data into a DataFrame(s) and afterwards analyse it. I think, your error occurs because you calculate `d_range` and `data.variables['tmp']` have different indexes. — Andrew, Sep 28 '22 at 08:48

Andrew · Answer 1 · 2022-09-28T09:58:27.790

I assume, you want to read in all .nc data and map the closest city to it. For that, I suggest to read all data first and afterwards calculate to which city a location belongs. The following code probably needs some adoptions to your data. It should show in which direction you could go to get the code more robust.

Step 1: Import your 'raw' data

e.g. into a DataFrame(s). Depends if you can import all data at once. If not split step 1 and 2 into chunks

df_list = []
for file in glob.glob('*.nc'):
    data = Dataset(file,'r')
    df_i = pd.DataFrame({
variables.keys())
        'time': data.variables['time'][:],
        'lat': data.variables['lat'][:],
        'lon': data.variables['lon'][:],
        'tmp': data.variables['tmp'][:],
        'stn': data.variables['stn'][:],
        'year':  str(file)[4:13],  # maybe not needed as 'time' should have this info already, and [4:13] needs exactly this format
        'file_name': file,  # to track back the file
        # ... and more
        })

    df_list.append(df_i)

df = pandas.concat(df_list, ignore_index=True)

Second step: map the locations

e.g. with groupby but there are several other methods. Depending on the amount of data, I suggest to use pandas or numpy routines over any python loops. They are way faster.

df['city'] = None
gp = df.groupby(['lon', 'lat'])
for values_i, indexes_i in gp.groups.items():
    # Add your code to get the closest city
    # values_i[0] is 'lon'
    # values_i[1] is 'lat'
    
    # e.g.:
    diff_lon_lat = np.hypot(cities['lon']-values_i[0], cities['lat']-values_i[1])
    location = cities.loc[diff_lon_lat.argmin(), 'code_nbs']
    
    # and add the parameters to the df
    df.loc[indexes_i, 'city'] = location

Thank you Andrew. It is exactly what I want to do. Actually, my code worked for another set of ncfiles. I don't understand why the dimensions of this ncfiles cause problems... — ele_al_12, Sep 30 '22 at 08:11
Perfect, if you like, you could accept the answer to mark it solved. As mentioned above in the comments. You calculate your indexes (here: `t_index`) by assuming all your arrays has (at least) a specific size. But what if some arrays are shorter? Your error is similar as: `np.arange(4)[4]`. — Andrew, Sep 30 '22 at 09:06

Trouble with dimensions in netcdf : index exceeds dimension bounds

1 Answers1

Step 1: Import your 'raw' data

Second step: map the locations