0

I have 7 2D cloud optimised geotiffs stacked into one data array in xarray. They are very large so I am using the intake-xarray extension and dask for streaming the data from s3 without using any RAM. I have concatenated them along their "band" dimension to stack them.

catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
            'dem_slope',
            'dem_slope_aspect',
            'distance_railways',
            'distance_river',
            'distance_road',
            'worldcover']
to_concat = []
for data in datasets:
    x = catalog[data].to_dask()
    to_concat.append(x)


merged = xr.concat(to_concat, dim='band')
merged.coords['band'] = datasets  # add labels to band dimension

y_array = catalog["NASA_global"]
y_array.coords['band'] = 'NASA_global'

merged 
<xarray.DataArray (band: 7, y: 225000, x: 450000)>
dask.array<concatenate, shape=(7, 225000, 450000), dtype=float32, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) <U31 'dem' ... 'worldcover'
  * y        (y) float64 90.0 90.0 90.0 90.0 90.0 ... -90.0 -90.0 -90.0 -90.0
  * x        (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 180.0
Attributes:
    transform:      (0.0008, 0.0, -180.0, 0.0, -0.0008, 90.0)
    crs:            +init=epsg:4326
    res:            (0.0008, 0.0008)
    is_tiled:       1
    nodatavals:     (32767.0,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Area

My question is how I could now convert the data into a several 1D columns, equivalent to flattening a 2D array in numpy? I have looked at .squeeze() to remove dimension but can't get it into the desired format. I want to do some machine learning and need it in a suitable format. New to dask and xarray.

I'd really appreciate any help or advice.

  • This may be a case for [`xr.Dataset.to_dask_dataframe`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_dask_dataframe.html) – Michael Delgado Jul 27 '22 at 05:44
  • Yes absolutely. I wanted to see if there was a way to flatten the data in xarray to keep track of the coordinate labels ('x' and 'y') along the way. This would allow me to reshape back to the original array at any time with no information loss. Masking nan values is a DataFrame will mean I can't easily reshape but I got around it by adding columns to the Dask data frame that index the array row and col. – wmitchell93 Jul 27 '22 at 10:52

1 Answers1

1

For anyone interested, I worked out how to do it in Xarray but it blew up the memory of my instance.

# load the intake catalog
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
            'dem_slope',
            'dem_slope_aspect',
            'distance_railways',
            'distance_river',
            'distance_road',
            'worldcover']
to_concat = []
for data in datasets:
    x = catalog[data].to_dask()
    to_concat.append(x)

# define X and y
merged = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
merged.coords['band'] = datasets
X_array = merged.values
y_array = catalog["NASA_global"].to_dask()
y_array.coords['band'] = 'NASA_global'

# reshape
X_temp = X_array.stack(z=('x','y'))
X = X_temp.transpose('z', 'band')

Calling X_array = merged.values loads everything into a numpy array and kills the instance. A colleague worked out a better solution that doesn't eat memory:

catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
            'dem_slope',
            'dem_slope_aspect',
            'distance_railways',
            'distance_river',
            'distance_road',
            'worldcover']
to_concat = []
for data in datasets:
    x = catalog[data].to_dask()
    to_concat.append(x)

# define X and y
X_array = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
X_array.coords['band'] = datasets
y_array = catalog["NASA_global"].to_dask()

# reshape
X_table = X_array.data.reshape((7, -1), merge_chunks=True)
y_table = y_array.data.reshape((1, -1), merge_chunks=True)
X = dd.from_dask_array(X_table.T, columns=datasets)
y = dd.from_dask_array(y_table.T, columns=['NASA_global'])