0

I am desperately trying to split strings within an xarray.Dataarray. What should happen to every element of the array is e.g.

"aaabbbccc" --> [aaa, bbb, ccc]

Fortunately, such a function already exists in the textwrap library, but applying it to my Dataarray is a different story:

xds = riox.open_rasterio(fp_output_tmp_mlsieved, chunks = "auto")

<xarray.DataArray (band: 1, y: 2, x: 2)>
dask.array<transpose, shape=(1, 2, 2), dtype=<U18, chunksize=(1, 2, 2), chunktype=numpy.ndarray>
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 3.077e+06 3.077e+06 ... 3.077e+06 3.077e+06
  * y            (y) float64 1.865e+06 1.865e+06 ... 1.865e+06 1.865e+06
    spatial_ref  int64 0

Loaded it looks like this:

array([[['000000000000000000', '000000000000000000'],
        ['000000000000000000', '000000000000000000']]], dtype='<U18')

I think a solution is to apply it with xr.apply_ufunc(). I have managed to do that with a simpler numpy function before, but with wrap() all I get is a bunch of errors. I think the main issue is that it is not a vectorized numpy function and second that I can´t get the dimensions to work out. My latest try looks like that:

def decompressor(s, l):
return np.array(wrap(s.item(), l))



def ufunc_decompressor(s, l):
    return xr.apply_ufunc(
        decompressor,
        s, l,
        output_dtypes=[np.dtype(f"U{l}")],
        input_core_dims=[["band"],[]],
        output_core_dims=[["band"]],
        exclude_dims=set(("band",)),
        dask="parallelized",
        vectorize=True
        )

 xds_split = ufunc_decompressor(xds, 3).load()

What I get is a cryptic error:

  File "/home/.../miniconda3/envs/postproc/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in <genexpr>
    core_output_shape = tuple(core_shapes[d] for d in ocd)
KeyError: 'dim0'
Anton Menshov
  • 2,266
  • 14
  • 34
  • 55
  • Can you step back and explain what you’re actually trying to do? Why do you have numbers represented by strings? How do you want them reshaped? Can you use solutions other than text wrap, which is not at all designed to be used with numpy or dask arrays? – Michael Delgado Dec 17 '22 at 02:37
  • The data comes from a geotiff containing a single band of int64 values. Each value actually represents 6 x 3 zero padded and joined codes. As an exmple (but with only 2x3 digits): Original Geotiff integer: 1020 To String and zero pad to 6 digits: "001020" – martin-git Dec 17 '22 at 10:39
  • Additionally: The goal is to write each code into a single geotiff. So my plan is to split up the string, turn to integer again and write each "band" into a geotiff: "001020" --> ["001", "020"] -->[1, 20] – martin-git Dec 17 '22 at 10:48
  • If the data is currently int64 it would be significantly faster and more memory efficient to do this with math (eg floor and modulo division) than to convert it to a string and use text wrap. Something along the lines of `xr.concat([da // 1000000, da // 1000 % 1000, da % 1000], dim="new dim")` should work. Even if the data comes as a string this is how I’d do it - just convert to int first. – Michael Delgado Dec 17 '22 at 10:48
  • If you really want to use string manipulations, I’d use `xr.DataArray.str.slice`, eg `band1 = da.str.slice(0, 3); band2 = da.str.slice(3, 6);` etc. – Michael Delgado Dec 17 '22 at 10:51
  • Thanks! The "math" way is way more elegant. I think I can drop the first term in the example though: `[1020 // 1000000, 1020 // 1000 % 1000, 1020 % 1000] = [0, 1, 20]` Is there a reason you do dim="new dim"? How would you achieve to expand the original `DataArray (band: 1, y: 2, x: 2`) in the band dim e.g. `DataArray (band: 6, y: 2, x: 2)` – martin-git Dec 17 '22 at 11:34
  • Yeah you can adapt my code - it’s just an example. I’m sure you can figure out how to change "new dim" to whatever you want, and extend my three tiered example to six. If you [edit] your question to provide a full setup and more fully describe your goals, ideally as an [mre], I could provide a more complete answer. I’m still not totally clear on what you’re trying to do. – Michael Delgado Dec 17 '22 at 11:51
  • If band is length 1 and you want it to be length 6, just use `da.squeeze("band", drop=True)` to remove it prior to adding it back in with concat – Michael Delgado Dec 17 '22 at 11:53

0 Answers0