1

I want to extract monthly temperature data from several netCDF files in different locations. Files are built as follows:

> print(data.variables.keys())
dict_keys(['lon', 'lat', 'time', 'tmp','stn'])

Files hold names like "tmp_1901_1910."

Here is the code I use:

import glob
import pandas as pd
import os
import numpy as np
import time 

 
os.chdir('PATH/data_tmp')
all_years = []

for file in glob.glob('*.nc'):
    data = Dataset(file,'r')
    time_data = data.variables['time'][:]
    time = data.variables['time']
    year =  str(file)[4:13]

    all_years.append(year)
   
# Empty pandas dataframe
year_start = min(all_years)
end_year = max(all_years)

date_range = pd.date_range(start = str(year_start[0:4]) + '-01-01', end = str(end_year[5:9]) + '-12-31', freq ='M')

df = pd.DataFrame(0.0, columns = ['Temp'], index = date_range)


# Defining the location, lat, lon based on the csv data 
cities = pd.read_csv(r'PATH/cities_coordinates.csv', sep =',')


cities['city']= cities['city'].map(str)


for index, row in cities.iterrows():
    location = row['code_nbs']
    location_latitude = row['lat']
    location_longitude = row['lon']
     
    # Sorting the list
    all_years.sort()
    
    for yr in all_years:
        #Reading in the data
        data = Dataset('tmp_'+str(yr)+'.nc','r')
        
        # Storing the lat and lon data into variables of the netCDF file into variables
        lat = data.variables['lat'][:]
        lon = data.variables['lon'][:]
    
        # Squared difference between the specified lat, lon and the lat, lon of the netCDF
        sq_diff_lat = (lat - location_latitude)**2
        sq_diff_lon = (lon - location_longitude)**2
        
        
        # Retrieving the index of the min value for lat and lon
        min_index_lat = sq_diff_lat.argmin()
        min_index_lon = sq_diff_lon.argmin()
            
        # Accessing the temperature data
        tmp  = data.variables['tmp']
        
        start = str(yr[0:4])+'-01-01'
        end = str(yr[5:11])+'-12-31'
        d_range = pd.date_range(start = start, end = end, freq='M')
        
        for t_index in np.arange(0, len(d_range)):
             print('Recording the value for: '+str(d_range[t_index]))
             df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
           
    df.to_csv(location +'.csv')

I obtain the following message while running the command df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]

IndexError: index exceeds dimension bounds

I inspect the object's values and have:

print(d_range)
DatetimeIndex(['1901-01-31', '1901-02-28', '1901-03-31', '1901-04-30',
               '1901-05-31', '1901-06-30', '1901-07-31', '1901-08-31',
               '1901-09-30', '1901-10-31',
               ...
               '1910-03-31', '1910-04-30', '1910-05-31', '1910-06-30',
               '1910-07-31', '1910-08-31', '1910-09-30', '1910-10-31',
               '1910-11-30', '1910-12-31'],
              dtype='datetime64[ns]', length=120, freq='M')

On the first t_index within the loop, I have:

print(t_index)
0

print(d_range[t_index])
1901-01-31 00:00:00

print(min_index_lat)
259
print(min_index_lon)
592

I don't understand what went wrong with the dimensions.

Thank you for any help!

ele_al_12
  • 23
  • 6
  • It would help if you said what line was causing the error, and preferably made the problem reproducible. It seems impossible to provide an answer without more info – Robert Wilson Sep 27 '22 at 16:10
  • This error occurs when you provide an index which is outside the allowed range of values in the file. Can you inspect the values of `d_range`, `d_range[t_index]`, and the indexers `min_index_lon, min_index_lat, t_index` to ensure they all fall within the bounds of df and tmp? if you think they do, can you print these out and copy the result into your question (as a code block), and also include the full traceback? it's always important to post the full traceback when asking questions about errors. thanks! – Michael Delgado Sep 27 '22 at 16:57
  • I recommend checking out xarray and regridding packages such as xesmf. Those packages offer very easy and efficient ways to handle situations like this – Robert Wilson Sep 27 '22 at 21:51
  • Thank you for your comment Robert, I am a beginner in python and not sure to understand what you recommend. I do not use xarray, why should I check out it ? – ele_al_12 Sep 28 '22 at 08:29
  • You do a lot of things while reading in the file(s). I suggest to split it up. First import your 'raw' data into a DataFrame(s) and afterwards analyse it. I think, your error occurs because you calculate `d_range` and `data.variables['tmp']` have different indexes. – Andrew Sep 28 '22 at 08:48

1 Answers1

0

I assume, you want to read in all .nc data and map the closest city to it. For that, I suggest to read all data first and afterwards calculate to which city a location belongs. The following code probably needs some adoptions to your data. It should show in which direction you could go to get the code more robust.

Step 1: Import your 'raw' data

e.g. into a DataFrame(s). Depends if you can import all data at once. If not split step 1 and 2 into chunks

df_list = []
for file in glob.glob('*.nc'):
    data = Dataset(file,'r')
    df_i = pd.DataFrame({
variables.keys())
        'time': data.variables['time'][:],
        'lat': data.variables['lat'][:],
        'lon': data.variables['lon'][:],
        'tmp': data.variables['tmp'][:],
        'stn': data.variables['stn'][:],
        'year':  str(file)[4:13],  # maybe not needed as 'time' should have this info already, and [4:13] needs exactly this format
        'file_name': file,  # to track back the file
        # ... and more
        })

    df_list.append(df_i)

df = pandas.concat(df_list, ignore_index=True)

Second step: map the locations

e.g. with groupby but there are several other methods. Depending on the amount of data, I suggest to use pandas or numpy routines over any python loops. They are way faster.

df['city'] = None
gp = df.groupby(['lon', 'lat'])
for values_i, indexes_i in gp.groups.items():
    # Add your code to get the closest city
    # values_i[0] is 'lon'
    # values_i[1] is 'lat'
    
    # e.g.:
    diff_lon_lat = np.hypot(cities['lon']-values_i[0], cities['lat']-values_i[1])
    location = cities.loc[diff_lon_lat.argmin(), 'code_nbs']
    
    # and add the parameters to the df
    df.loc[indexes_i, 'city'] = location
Andrew
  • 817
  • 4
  • 9
  • 1
    Thank you Andrew. It is exactly what I want to do. Actually, my code worked for another set of ncfiles. I don't understand why the dimensions of this ncfiles cause problems... – ele_al_12 Sep 30 '22 at 08:11
  • Perfect, if you like, you could accept the answer to mark it solved. As mentioned above in the comments. You calculate your indexes (here: `t_index`) by assuming all your arrays has (at least) a specific size. But what if some arrays are shorter? Your error is similar as: `np.arange(4)[4]`. – Andrew Sep 30 '22 at 09:06