0

I have a netCDF that looks like the following ad is available here:

import xarray as xr
ds=xr.open_dataset('merged_new.nc')
ds

enter image description here

The dataset has a spatial resolution of 300m. The variable Forest is a binary variable 0 or 1.

For each pixel (point) I would like to count the sum of Forest in a radius of 30km (i.e. ~300 closest pixels). I would like to do a sort of moving windows that allow me to assign to each pixel the sum of Forest in a radius of 30km.

Ideally I would like to add a variable to the dataset called ForestSum that is the sum of Forest in the surrounding.

Is there a way to do this?

emax
  • 6,965
  • 19
  • 74
  • 141
  • Do you know how to convert Forest, Lat, Lon to Pandas dataframe? (or numpy array) – Corralien Mar 31 '22 at 08:56
  • yes sure. Converting to a Pandas dataframe would be to much time consuming however. – emax Mar 31 '22 at 08:58
  • I know how to find all points inside a radius but I need to extract each coordinates of Forest :) And I don't know how to do that. – Corralien Mar 31 '22 at 09:02
  • @Corralien we can extract the values in this way, `forest=ds['Forest'].values`, `latitutde=ds['lat'].values`, `longitude=ds['lon'].values` – emax Mar 31 '22 at 09:22
  • You have too many records in your dataset!!!. I will post my method but it's unusable :( – Corralien Mar 31 '22 at 09:29

1 Answers1

0

Use BallTree from sklearn:

import xarray as xr
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric

ds = xr.open_dataset('merged_new.nc')

forest = ds['Forest'].values
latitude = ds['lat'].values
longitude = ds['lon'].values

df = pd.DataFrame(forest, latitude, longitude).rename_axis(index='lat', columns='lon') \
                                              .stack().rename('forest').reset_index()

# Prepare data
coords = np.radians(df[['lat', 'lon']])
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)

Now you can query data. Suppose for the first point

# lat: -7.1125, lon: -73.9875
>>> df.iloc[tree.query_radius([coords.iloc[0]], r=0.003)[0]]['forest'].sum()
3080

In theory, you can do but the process is killed because there are too many records:

indices = tree.query_radius(coords, r=0.003)
df['sum'] = [df.iloc[i]['forest'].sum() for i in indices]

You can use a multiprocessing version and slice data.

Corralien
  • 109,409
  • 8
  • 28
  • 52