0

I looked at similar answers but could not find one that suits my problem.

I have a daily covariance matrix that is constructed as a MultiIndex. It has "k" dates, and for each date "k" it has a matrix of size "n" by "n". The dimensions are technically (k, n, n) but as this is a MultiIndex pandas views this as shape (kxn, n).

As a minimum working example, I can provide the below:

dates = pd.date_range(start='20120101', end='20210101', freq='D')
X = pd.DataFrame( {'A' : np.random.rand(len(dates)), 'B' : np.random.rand(len(dates)), 'C' : np.random.rand(len(dates)), 'D' : np.random.rand(len(dates)), 'E' : np.random.rand(len(dates)) }, index=dates).ewm(halflife=30, min_periods=1).cov()

I would like to resample it from daily to minutely. Assuming my MultiIndex covariance matrix is called "X", I have managed to get the following to work:

X.unstack().resample("T").first().ffill().stack()

However, this takes an awfully long time to compute. Is there a faster and more efficient way to perform this operation?

This used to be quick with Panels that were deprecated after pandas 0.24. From my own profiling work the most memory intensive part seems to be the 'stack()'

The User
  • 55
  • 3
  • 11
  • Hello, could you please provide a code sample that creates a minimal example of the DataFrame in question? It needs to be executable code as there isn't a straightforward way to share multiindexed DataFrames as text. – Steele Farnsworth Aug 21 '21 at 20:02
  • In either case, it is likely that the solution involves `arr = X.to_numpy()` and reshaping/transposing the array as needed. – Steele Farnsworth Aug 21 '21 at 20:05
  • As an example: `dates = pd.date_range(start='20120101', end='20210101', freq='D')` where `X = pd.DataFrame( {'A' : np.random.rand(len(dates)), 'B' : np.random.rand(len(dates)), 'C' : np.random.rand(len(dates)), 'D' : np.random.rand(len(dates)), 'E' : np.random.rand(len(dates)) }, index=dates).ewm(halflife=30, min_periods=1).cov()` – The User Aug 21 '21 at 20:44
  • Thank you; please edit your question to contain this information. – Steele Farnsworth Aug 21 '21 at 20:48
  • 1
    There are 1440 minutes in a day, so a slowdown and increase in memory by that factor is expected. – BatWannaBe Aug 21 '21 at 20:55
  • Yes, this is true, although with my limited programming experience I am sure there is a faster way to do this than what I have done... – The User Aug 23 '21 at 08:26
  • I think part of the solution can involve taking only the upper or lower triangle of the matrix since it is symmetric this can be achieve with triu/tril with pandas – user14518362 Sep 04 '21 at 05:08
  • Do you *have* to work with a `kn × n` format? You could also just have a `k -> (n×n)` dataframe, with a date index and `np.array` or `np.matrix` entries in a single column (`dtype: object`) as `df = pd.DataFrame(data={'mat': [np.matrix(np.random.rand(n, n)) for i in range(len(dates))]}, index=dates)` (not a true CoV matrix but you get the idea) – Salmonstrikes Sep 04 '21 at 20:25
  • Perhaps you can split in several Dataframes, parallelize the transformation and concat the result ? – Benoit Drogou Sep 06 '21 at 14:31
  • 1
    @TheUser. I think your resample lost 24*60-1 records due to closed interval. – Corralien Sep 08 '21 at 10:03

1 Answers1

1

IIUC

Using resample to add information ('D' -> 'T') is not a right choice especially if you want to fill forward values. You can use np.vsplit to create a Panel-like then repeat your array according to your DatetimeIndex and finally reshape the data:

# Create new MultiIndex
dates2 = pd.date_range(X.index.levels[0].min(),
                       X.index.levels[0].max() + pd.DateOffset(days=1), 
                       freq='T', closed='left')
mi = pd.MultiIndex.from_product([dates2, X.index.levels[1]])

# Manipulate your array
vals = np.array(np.repeat(np.vsplit(X.values, len(X.index.levels[0])), 24*60, axis=0))
vals = vals.reshape(vals.shape[0]*vals.shape[1], vals.shape[2])

# New dataframe
out = pd.DataFrame(vals, index=mi, columns=X.columns)

For a smaller sample:

>>> df
               A   B
2012-01-01 A  11  12
           B  13  14
2012-01-02 A  21  22
           B  23  24
2012-01-03 A  31  32
           B  33  34

# Resample: 12H and 2 values per day
# dates2 = pd.date_range(df.index.levels[0].min(), df.index.levels[0].max() + pd.DateOffset(days=1), freq='12H', closed='left')
# mi = pd.MultiIndex.from_product([dates2, df.index.levels[1]])
# vals = np.array(np.repeat(np.vsplit(df.values, len(df.index.levels[0])), 2, axis=0))
# vals = vals.reshape(vals.shape[0]*vals.shape[1], vals.shape[2])
# out = pd.DataFrame(vals, index=mi, columns=df.columns)

>>> out
                        A   B
2012-01-01 00:00:00 A  11  12
                    B  13  14
2012-01-01 12:00:00 A  11  12
                    B  13  14
2012-01-02 00:00:00 A  21  22
                    B  23  24
2012-01-02 12:00:00 A  21  22
                    B  23  24
2012-01-03 00:00:00 A  31  32
                    B  33  34
2012-01-03 12:00:00 A  31  32
                    B  33  34

With your code:

>>> df.unstack().resample("12H").first().ffill().stack()
                          A     B
2012-01-01 00:00:00 A  11.0  12.0
                    B  13.0  14.0
2012-01-01 12:00:00 A  11.0  12.0
                    B  13.0  14.0
2012-01-02 00:00:00 A  21.0  22.0
                    B  23.0  24.0
2012-01-02 12:00:00 A  21.0  22.0
                    B  23.0  24.0
2012-01-03 00:00:00 A  31.0  32.0
                    B  33.0  34.0
                                   # <- Lost 2012-01-03 12:00:00

Performance on X

>>> %timeit op_resample()
9.1 s ± 568 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit new_array()
1.86 s ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Corralien
  • 109,409
  • 8
  • 28
  • 52