I have 7 2D cloud optimised geotiffs stacked into one data array in xarray. They are very large so I am using the intake-xarray extension and dask for streaming the data from s3 without using any RAM. I have concatenated them along their "band" dimension to stack them.
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
merged = xr.concat(to_concat, dim='band')
merged.coords['band'] = datasets # add labels to band dimension
y_array = catalog["NASA_global"]
y_array.coords['band'] = 'NASA_global'
merged
<xarray.DataArray (band: 7, y: 225000, x: 450000)>
dask.array<concatenate, shape=(7, 225000, 450000), dtype=float32, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
* band (band) <U31 'dem' ... 'worldcover'
* y (y) float64 90.0 90.0 90.0 90.0 90.0 ... -90.0 -90.0 -90.0 -90.0
* x (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 180.0
Attributes:
transform: (0.0008, 0.0, -180.0, 0.0, -0.0008, 90.0)
crs: +init=epsg:4326
res: (0.0008, 0.0008)
is_tiled: 1
nodatavals: (32767.0,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: Area
My question is how I could now convert the data into a several 1D columns, equivalent to flattening a 2D array in numpy? I have looked at .squeeze() to remove dimension but can't get it into the desired format. I want to do some machine learning and need it in a suitable format. New to dask and xarray.
I'd really appreciate any help or advice.