4

I have a large time series of np.float64 with a 5-min frequency (size is ~2,500,000 ~=24 years).

I'm using Xarray to represent it in-memory and the time-dimension is named 'time'.

I want to group-by 'time.hour' and then 'time.dayofyear' (or vice-versa) and remove both their mean from the time-series.

In order to do that efficiently, i need to reorder the time-series into a new xr.DataArray with the dimensions of ['hour', 'dayofyear', 'rest'].

I wrote a function that plays with the GroupBy objects of Xarray and manages to do just that although it takes a lot of memory to do that...

I have a machine with 32GB RAM and i still get the MemoryError from numpy.

I know the code works because i used it on an hourly re-sampled version of my original time-series. so here's the code:

def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='dayofyear'):
    """Takes a time-series xr.DataArray objects and reshapes it using
    grp1 and grp2. outout is a xr.Dataset that includes the reshaped DataArray
    , its datetime-series and the grps."""
    import xarray as xr
    import numpy as np
    import pandas as pd

    # try to infer the freq and put it into attrs for later reconstruction:
    freq = pd.infer_freq(time_da[time_dim].values)
    name = time_da.name
    time_da.attrs['freq'] = freq
    attrs = time_da.attrs

    # drop all NaNs:
    time_da = time_da.dropna(time_dim)

    # group grp1 and concat:
    grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
    s_list = []
    for grp_name, grp_inds in grp_obj1.groups.items():
        da = time_da.isel({time_dim: grp_inds})
        s_list.append(da)
    grps1 = [x for x in grp_obj1.groups.keys()]
    stacked_da = xr.concat(s_list, dim=grp1)
    stacked_da[grp1] = grps1

    # group over the concatenated da and concat again:
    grp_obj2 = stacked_da.groupby(time_dim + '.' + grp2)
    s_list = []
    for grp_name, grp_inds in grp_obj2.groups.items():
        da = stacked_da.isel({time_dim: grp_inds})
        s_list.append(da)
    grps2 = [x for x in grp_obj2.groups.keys()]
    stacked_da = xr.concat(s_list, dim=grp2)
    stacked_da[grp2] = grps2

    # numpy part:
    # first, loop over both dims and drop NaNs, append values and datetimes:
    vals = []
    dts = []
    for i, grp1_val in enumerate(stacked_da[grp1]):
        da = stacked_da.sel({grp1: grp1_val})
        for j, grp2_val in enumerate(da[grp2]):
            val = da.sel({grp2: grp2_val}).dropna(time_dim)
            vals.append(val.values)
            dts.append(val[time_dim].values)

    # second, we get the max of the vals after the second groupby:
    max_size = max([len(x) for x in vals])

    # we fill NaNs and NaT for the remainder of them:
    concat_sizes = [max_size - len(x) for x in vals]
    concat_arrys = [np.empty((x)) * np.nan for x in concat_sizes]
    concat_vals = [np.concatenate(x) for x in list(zip(vals, concat_arrys))]
    # 1970-01-01 is the NaT for this time-series:
    concat_arrys = [np.zeros((x), dtype='datetime64[ns]')
                    for x in concat_sizes]
    concat_dts = [np.concatenate(x) for x in list(zip(dts, concat_arrys))]
    concat_vals = np.array(concat_vals)
    concat_dts = np.array(concat_dts)

    # finally , we reshape them:
    concat_vals = concat_vals.reshape((stacked_da[grp1].shape[0],
                                       stacked_da[grp2].shape[0],
                                       max_size))
    concat_dts = concat_dts.reshape((stacked_da[grp1].shape[0],
                                     stacked_da[grp2].shape[0],
                                     max_size))

    # create a Dataset and DataArrays for them:
    sda = xr.Dataset()
    sda.attrs = attrs
    sda[name] = xr.DataArray(concat_vals, dims=[grp1, grp2, 'rest'])
    sda[time_dim] = xr.DataArray(concat_dts, dims=[grp1, grp2, 'rest'])
    sda[grp1] = grps1
    sda[grp2] = grps2
    sda['rest'] = range(max_size)
    return sda

So for the 2,500,000 items time-series, numpy throws the MemoryError so I'm guessing this has to be my memory bottle-neck. What can i do to solve this ?

Would Dask help me ? and if so how can i implement it ?

Ziskin_Ziv
  • 116
  • 4

2 Answers2

1

Like you, I ran it without issue when inputting a small time series (10,000 long). However, when inputting a 100,000 long time series xr.DataArraythe grp_obj2 for loop ran away and used all the memory of the system.

This is what I used to generate the time series xr.DataArray:

n = 10**5
times = np.datetime64('2000-01-01') + np.arange(n) * np.timedelta64(5,'m')
data = np.random.randn(n)
time_da = xr.DataArray(data, name='rand_data', dims=('time'), coords={'time': times})
# time_da.to_netcdf('rand_time_series.nc')

As you point out, Dask would be a way to solve it but I can't see a clear path at the moment... Typically, the kind of problem with Dask would be to:

  1. Make the input a dataset from a file (like NetCDF). This will not load the file in memory but allow Dask to pull data from disk one chunk at a time.
  2. Define all calculations with dask.delayed or dask.futures methods for entire body of code up until the writing the output. This is what allows Dask to chunk a small piece of data to read then write.
  3. Calculate one chunk of work and immediately write output to new dataset file. Effectively you ending up steaming one chunk of input to one chunk of output at a time (but also threaded/parallelized).

I tried importing Dask and breaking the input time_da xr.DataArray into chunks for Dask to work on but it didn't help. From what I can tell, the line stacked_da = xr.concat(s_list, dim=grp1) forces Dask to make a full copy of stacked_da in memory and much more... One workaround to this is to write stacked_da to disk then immediately read it again:

##For group1
xr.concat(s_list, dim=grp1).to_netcdf('stacked_da1.nc')
stacked_da = xr.load_dataset('stacked_da1.nc')
stacked_da[grp1] = grps1

##For group2
xr.concat(s_list, dim=grp2).to_netcdf('stacked_da2.nc')
stacked_da = xr.load_dataset('stacked_da2.nc')
stacked_da[grp2] = grps2

However, the file size for stacked_da1.nc is 19MB and stacked_da2.nc gets huge at 6.5GB. This is for time_da with 100,000 elements... so there's clearly something amiss...

Originally, it sounded like you want to subtract the mean of the groups from the time series data. It looks like Xarray docs has an example for that. http://xarray.pydata.org/en/stable/groupby.html#grouped-arithmetic

kilozulu
  • 347
  • 1
  • 9
  • I did not mention it, but the reason i want the time-series restructured is 1)i can see the seasonality vs. the groups i select. so for example, if i choose 2D pcolormesh plot of the time-series (e.g., x:hour of the day, y:dayofyear) it's bc i suspect the signal has daily and yearly seasonality. 2) Xarray can subtract mean of one group easily but no grouping of two or more coords exists as of yet (in the Xarray ecosystem - i'll be happy to be wrong about this). – Ziskin_Ziv Dec 15 '19 at 15:52
  • Now that i think of it, the solution is trivial, tough boring. I could just divide the original time-series into smaller chunks (i could empirically test it for correct chunk size so it won't clog my memory) and run the function and afterwards concat the outputs. – Ziskin_Ziv Dec 15 '19 at 15:54
  • 1
    You may be able to achieve that without re-shaping. It may work to: 1) group by hour from original data and subtract the hourly mean and plot 2) then group by day from original data and subtract daily mean and plot I having a hard time picturing exactly what your final result should look like so I may have the order of steps or logic is off. – kilozulu Dec 15 '19 at 16:22
  • Can you please upvote my answer too? I spent a chunk of time diving into this... – kilozulu Dec 15 '19 at 16:28
1

The key is to group once and loop over the groups and then group again on each of the groups and append it to list.

Next i concat and use pd.MultiIndex.from_product for the groups.

No Memory problems and no Dask needed and it only takes a few seconds to run.

here's the code, enjoy:

def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='month',
                      plot=True):
    """Takes a time-series xr.DataArray objects and reshapes it using
    grp1 and grp2. output is a xr.Dataset that includes the reshaped DataArray
    , its datetime-series and the grps. plots the mean also"""
    import xarray as xr
    import pandas as pd
    # try to infer the freq and put it into attrs for later reconstruction:
    freq = pd.infer_freq(time_da[time_dim].values)
    name = time_da.name
    time_da.attrs['freq'] = freq
    attrs = time_da.attrs
    # drop all NaNs:
    time_da = time_da.dropna(time_dim)
    # first grouping:
    grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
    da_list = []
    t_list = []
    for grp1_name, grp1_inds in grp_obj1.groups.items():
        da = time_da.isel({time_dim: grp1_inds})
        # second grouping:
        grp_obj2 = da.groupby(time_dim + '.' + grp2)
        for grp2_name, grp2_inds in grp_obj2.groups.items():
            da2 = da.isel({time_dim: grp2_inds})
            # extract datetimes and rewrite time coord to 'rest':
            times = da2[time_dim]
            times = times.rename({time_dim: 'rest'})
            times.coords['rest'] = range(len(times))
            t_list.append(times)
            da2 = da2.rename({time_dim: 'rest'})
            da2.coords['rest'] = range(len(da2))
            da_list.append(da2)
    # get group keys:
    grps1 = [x for x in grp_obj1.groups.keys()]
    grps2 = [x for x in grp_obj2.groups.keys()]
    # concat and convert to dataset:
    stacked_ds = xr.concat(da_list, dim='all').to_dataset(name=name)
    stacked_ds[time_dim] = xr.concat(t_list, 'all')
    # create a multiindex for the groups:
    mindex = pd.MultiIndex.from_product([grps1, grps2], names=[grp1, grp2])
    stacked_ds.coords['all'] = mindex
    # unstack:
    ds = stacked_ds.unstack('all')
    ds.attrs = attrs
    return ds
Ziskin_Ziv
  • 116
  • 4