Xarray Dask.delayed slow: how to be fast to select/interpolate between two datasets

Question

I have two datasets (called satdata and atmosdata). Atmosdata is evenly gridded on latitude and longitude. Atmosdata has the dimension (latitude: 713, level: 37, longitude: 1440, time: 72), and has a total size of 12GB. Atmosdata has several variables such as temperature, humidity etc. Humidity has the shape of (time, level, latitude, longitude).

Satdata contains satellite observations and has the dimension of (across_track: 90, channel: 3, time: 32195), with 90*3*32195=8692650 points of data. Across_track means the satellite FOV across track position. Satdata is not evenly gridded in latitude/longitude. For example, satdata.latitude has a dimension of (time, channel, across_track), the same for satdata.longitude, satdata.sft.

The 'time' variable in Atmosdata and satdata contains time within the same day, but with different values in these two datasets. I need find atmosdata (humidity and temperature for example) which has the same latitude, longitude and time as satdata.

To realize this, I iterate over satdata to find the location and time of each observation; then I find the corresponding atmosdata (firstly the nearest grid to satellite data location, then interpolated to satellite time). Finally I concatenate the resulting atmosdata from all the iterations into one dataset.

A part of my code is as follows, by using a small piece of data.

import xarray as xr
import numpy as np
import dask
import pandas as pd

# define a simple satdata dataset
lon = np.array([[18.717, 18.195, 17.68  ], [18.396, 17.87 , 17.351, ]])
lat = np.array([[-71.472, -71.635, -71.802],
   [-71.52 , -71.682, -71.847]])
sft = np.array([[1, 1, 1],
   [1, 1, 1]])
time = np.array(['2010-09-07T00:00:01.000000000', '2010-09-07T00:00:03.000000000'],dtype='datetime64[ns]')
satdata = xr.Dataset({'sft': (['time','across_track'], sft)}, coords = {'longitude': (['time','across_track'], lon), 'latitude': (['time','across_track'], lat), 'time':time })

# atmosdata
atmoslat = np.array([-71.75, -71.5 , -71.25, -71.  , -70.75, -70.5 , -70.25, -70.  , -69.75 ])
atmoslon = np.array([17.25, 17.5 , 17.75, 18.  , 18.25, 18.5 , 18.75, 19.  , 19.25])
atmostime = np.array(['2010-09-07T00:00:00.000000000', '2010-09-07T01:00:00.000000000'],dtype='datetime64[ns]')

atmosq = np.random.rand(2,9,9)
atmosdata = xr.Dataset({'q': (['time', 'latitude', 'longitude'], atmosq)}, coords={'longitude':(['longitude'], atmoslon), 'latitude': (['latitude'], atmoslat), 'time':(['time'], atmostime)})

# do the matching:
matched = dask.compute(match(atmosdata, satdata),scheduler='processes',  num_workers=20)[0]

The matching function is as below:

@dask.delayed
def match(atmosdata, satdata):
    newatmos = []
    newind = 0
    # iterate over satdata
    for i in np.ndenumerate(satdata.sft):
        if i[1] != np.nan:
           # find one latitude and longitude of satellite data
           temp_lat = satdata.latitude.isel(time=[i[0][0]], across_track=[i[0][1]])
           temp_lon = satdata.longitude.isel(time=[i[0][0]],  across_track=[i[0][1]])
           # find the atmosdata in the grid nearest to this location
           temp_loc  =  atmosdata.sel(latitude =temp_lat.values.ravel()[0], longitude = temp_lon.values.ravel()[0], method='nearest')
           if temp_loc.q.all() > 0:
               # find atmosdata at the satellite time by interpolation
               temp_time = satdata.time.isel(time=[i[0][0]])
               newatmos.append(temp_loc.interp( time = temp_time.data.ravel() ))
               newind += 1

    return xr.concat(newatmos,dim=pd.Index(range(newind), name='NewInd'))

1) When I launch the code, it works. But if I do not use the small data size in the code, instead I use my original data (with the dimensions mentioned above), then I launch the computation and I got errors.

---> 52 matched = dask.compute(match(ecmwfdata, ssmis_land), scheduler='processes', num_workers=20 )
error: 'i' format requires -2147483648 <= number <= 2147483647

2) If I use datasets with other dimensions, satdata (across_track: 90, channel: 3, time: 100), and atmosdata (latitude: 71, level: 37, longitude: 1440, time: 72), the computation takes very long time. I guess that my coding is not optimal to use DASK in solving this problem quickly.

2) Is there a better way than using the for loop ? The for loop may not take advantage of DASK for the purpose of quick computation ?

3) Would it be a good idea to chunk the satdata, then find the limit of latitude and longitude in a chunk of satdata, then chunk the atmosdata according to this limit, and finally apply the match function to each chunk of satdata and atmosdata ? If this is good idea, I do not know yet how to iterate over each chunk of satdata manually....

4) The function uses two arguments, satdata and atmosdata. Since these two datasets can be quite big (12G for atmosdata), the computation would be slower then?

5) In the function I had to use .value in the selection, shall this make slower the computation, when using large input data ?

Thanks in advance !

Best regards

Xiaoni

Hi Xiaoni, welcome on SO. Do you mind to share a [mcve](https://stackoverflow.com/help/mcve) [mcve2](http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)? I guess than there is a better way than a for loop in order to filter data. — rpanai, Sep 24 '18 at 12:09
It will be better if you can provide a sample of your dataset. Even a dummy version of it to work with will be great. — rpanai, Sep 24 '18 at 13:41
Yes, I created a very small dataset in order to test the code. Thanks in advance ! — Xiaoni, Sep 24 '18 at 15:13
It would be helpful if you also define atmosdata as a simple dummy dataset, then people can test your function. — dreab, Sep 28 '18 at 06:37
Hi Dreab, I added a small dataset for the atmosdata. Thanks in advance ! — Xiaoni, Sep 28 '18 at 09:55

score 1 · Answer 1 · answered Mar 26 '20 at 11:07

I'm not really sure that I can help other than to say I think you need to use a resampling package that uses distance trees

https://github.com/pytroll/pyresample

Pyresample has a swath grid that should do what you want

I have used https://github.com/storpipfugl/pykdtree to find nearest points

Xarray Dask.delayed slow: how to be fast to select/interpolate between two datasets

1 Answers1