0

I have a structured numpy array, containing sampled data from several measurement series: Each series samples m as a function of l, and differs from the other series by a. l is not sampled at constant values, and there is a different number of samples per series, so we can't just generate a 2D array for each of m and l. Example data:

    In [1]: data
    Out[1]:array([(   0., 1323.,  69384.), (   0., 1344.,  73674.), (   0., 1344.,  73674.),
           (   0., 1439.,  76678.), (   0., 1538.,  79584.), (   0., 1643.,  82389.),
           (   0., 2382.,  95634.), (   0., 2439.,  96028.), (   0., 2439.,  96028.),
           (   0., 2574.,  98154.), (   0., 2795.,  99937.), (1219., 1316.,  59055.),
           (1219., 1332.,  61473.), (1219., 1350.,  63881.), (1219., 1372.,  66270.),
           (1219., 1372.,  66270.), (1219., 1491.,  69654.), (1219., 1617.,  72917.),
           (1219., 1749.,  76053.), (1219., 1885.,  79060.), (1219., 2028.,  81927.),
           (1219., 2072.,  82803.), (1219., 2118.,  83606.), (1219., 2166.,  84340.),
           (1219., 2846.,  91028.), (1219., 2911.,  91379.), (1219., 2977.,  91635.),
           (1219., 4164.,  95161.), (2438., 1313.,  52688.), (2438., 1331.,  54496.),
           (2438., 1350.,  56304.), (2438., 1368.,  58113.), (2438., 1480.,  60990.),
           (2438., 1598.,  63754.), (2438., 1720.,  66399.), (2438., 1846.,  68926.),
           (2438., 1978.,  71326.), (2438., 2757.,  79713.), (2438., 2819.,  80026.),
           (2438., 2882.,  80258.), (2438., 4155.,  84968.)],
          dtype=[('a', '<f8'), ('l', '<f8'), ('m', '<f8')])

I'm pretty sure that it should be possible to turn this into an xarray.DataArray, since that is made to deal with incomplete data.

What I would like to end up with is an xarray.DataSet, with two coordinates, a and i, where i is a simple integer index enumerating the data points in each measurement series. That way, I can get the measurements at each value of a, by the first index, and e.g. the first measured sample by selecting i=0.

So my preferred end result would look like this:

In [100]: desired_array
Out[100]: <xarray.Dataset>
          Dimensions:  (a: 3, i: 17)
          Coordinates:
            * a  (a) float64 0. 1219. 2438.
            * i  (i) int64 0 1 2 3 4 ... 15 16 17
          Data variables:
              l        (a, i) float64 1323. 1344. 1344. 1439. ... 2882. 4155.
              m        (a, i) float64 69384. 73674. 73674. ... 80258. 84968.

There would be some missing values in there as well, since there are not exactly 17 data values for each value of a, but my understanding is that xarray can deal with this.

Simply defining an xarray, and specifying data['m'] as the data, and the other two as coordinates, fails because that requires a 2D array as input, and data['m'] only has one dimension. I suppose I could manually iterate through the data, find the points where 'a' changes and then generate 2D arrays for both 'm' and 'l', where each column corresponds to one value of 'a', and put them into an xarray.DataSet (with two data variables 'l' and 'm', and coordinates 'a' and another unnamed one, which enumerates the points of each measurement series), but then I'd first have to figure out the length of the longest measurement series, and the resulting intermediate np.array would contain a bunch of empty fields. The full dataset has several more variables that differ between series, and I imagine at that point implementing code to sort everything would get pretty tedious.

The recommended way, according to xarray documentation, is to convert to a Pandas dataframe first, and then use pandas.DataFrame.to_xarray().

As I've just been made aware (thanks to jhamman), pandas is actually a dependency of xarray, so this should be convenient.

however ... In [61]: tempdf = pandas.DataFrame(data, index=data['a'].astype(int), columns=['l', 'm'])

In [62]: tempdf
Out[62]: 
             l        m
0     1323.0  69384.0
0     1344.0  73674.0
0     1344.0  73674.0
0     1439.0  76678.0
0     1538.0  79584.0
0     1643.0  82389.0
0     2382.0  95634.0
0     2439.0  96028.0
0     2439.0  96028.0
0     2574.0  98154.0
0     2795.0  99937.0
1219  1316.0  59055.0
1219  1332.0  61473.0
1219  1350.0  63881.0
...

It seems that Pandas does not notice that my chosen index is repeating, and does not group the data accordingly. Also, I'd like to add a second index which goes through the data where a is constant.

Knowing that I probably won't like the result, I convert the above to xarray, and get:

tempdf.to_xarray()
Out[66]: 
<xarray.Dataset>
Dimensions:  (index: 41)
Coordinates:
  * index    (index) int64 0 0 0 0 0 0 0 ... 2438 2438 2438 2438 2438 2438 2438
Data variables:
    l        (index) float64 1.323e+03 1.344e+03 ... 2.882e+03 4.155e+03
    m        (index) float64 6.938e+04 7.367e+04 ... 8.026e+04 8.497e+04

...not what I wanted:

  • the index has lost its name
  • the index variable has repeating values
  • ...and of course the data is still in 1D format.

... what am I not getting? I tried different variations on the data above, and sometimes pandas seemed to accept an index variable as I wanted to, sometimes it did not, but I haven't worked out what the problem is, and I definitely haven't worked out how to add a generic index, especially since the measurement series are not of equal length (so there's no nice and regular array that could hold the data).

Zak
  • 3,063
  • 3
  • 23
  • 30
  • Does it help? https://stackoverflow.com/questions/70319614/how-to-create-a-numpy-array-to-an-xarray-data-array – Corralien Jan 18 '23 at 11:10
  • 1
    Pandas is a core dependency of Xarray (for now). If you want to use Xarray, you'll need to also install Pandas. https://docs.xarray.dev/en/stable/getting-started-guide/installing.html – jhamman Jan 18 '23 at 16:23
  • oh dear, how did I miss that? I was sure I didn't have it, but it turns out that I do, indeed! – Zak Jan 18 '23 at 17:31
  • @Corralien: No, the problem in my case is that the input data is effectively a flattened version of the (irregular) array I'd like to end up with, and the coords are simply repeated for each sample. The problem in the question you linked to is that the OP didn't pass the coords in the correct way, but their data was already in a correctly-dimensioned array. – Zak Jan 18 '23 at 17:35
  • So can you show us a small example (small dimensions) of what you expect please? – Corralien Jan 18 '23 at 17:37
  • @Corrallien: Just added an example of what I'd like to end up with. The full dataset I have has two more "indexing" variables, which I'd want to use as coords, as well, and should turn the data variables into 4-dimensional arrays (3 variables constant for each series of measurements, 1 index within the series) – Zak Jan 23 '23 at 11:54

0 Answers0